Method and device for efficient frame erasure concealment in linear predictive based speech codecs

ABSTRACT

The present invention relates to a method and device for improving concealment of frame erasure caused by frames of an encoded sound signal erased during transmission from an encoder ( 106 ) to a decoder ( 110 ), and for accelerating recovery of the decoder after non erased frames of the encoded sound signal have been received. For that purpose, concealment/recovery parameters are determined in the encoder or decoder. When determined in the encoder ( 106 ), the concealment/recovery parameters are transmitted to the decoder ( 110 ). In the decoder, erasure frame concealment and decoder recovery is conducted in response to the concealment/recovery parameters. The concealment/recovery parameters may be selected from the group consisting of: a signal classification parameter, an energy information parameter and a phase information parameter. The determination of the concealment/recovery parameters comprises classifying the successive frames of the encoded sound signal as unvoiced, unvoiced transition, voiced transition, voiced, or onset, and this classification is determined on the basis of at least a part of the following parameters: a normalized correlation parameter, a spectral tilt parameter, a signal-to-noise ratio parameter, a pitch stability parameter, a relative frame energy parameter, and a zero crossing parameter.

FIELD OF THE INVENTION

The present invention relates to a technique for digitally encoding asound signal, in particular but not exclusively a speech signal, in viewof transmitting and/or synthesizing this sound signal. Morespecifically, the present invention relates to robust encoding anddecoding of sound signals to maintain good performance in case of erasedframe(s) due, for example, to channel errors in wireless systems or lostpackets in voice over packet network applications.

BACKGROUND OF THE INVENTION

The demand for efficient digital narrow- and wideband speech encodingtechniques with a good trade-off between the subjective quality and bitrate is increasing in various application areas such asteleconferencing, multimedia, and wireless communications. Untilrecently, a telephone bandwidth constrained into a range of 200-3400 Hzhas mainly been used in speech coding applications. However, widebandspeech applications provide increased intelligibility and naturalness incommunication compared to the conventional telephone bandwidth. Abandwidth in the range of 50-7000 Hz has been found sufficient fordelivering a good quality giving an impression of face-to-facecommunication. For general audio signals, this bandwidth gives anacceptable subjective quality, but is still lower than the quality of FMradio or CD that operate on ranges of 20-16000 Hz and 20-20000 Hz,respectively.

A speech encoder converts a speech signal into a digital bit streamwhich is transmitted over a communication channel or stored in a storagemedium. The speech signal is digitized, that is, sampled and quantizedwith usually 16-bits per sample. The speech encoder has the role ofrepresenting these digital samples with a smaller number of bits whilemaintaining a good subjective speech quality. The speech decoder orsynthesizer operates on the transmitted or stored bit stream andconverts it back to a sound signal.

Code-Excited Linear Prediction (CELP) coding is one of the bestavailable techniques for achieving a good compromise between thesubjective quality and bit rate. This encoding technique is a basis ofseveral speech encoding standards both in wireless and wirelineapplications. In CELP encoding, the sampled speech signal is processedin successive blocks of L samples usually called frames, where L is apredetermined number corresponding typically to 10-30 ms. A linearprediction (LP) filter is computed and transmitted every frame. Thecomputation of the LP filter typically needs a lookahead, a 5-15 msspeech segment from the subsequent frame. The L-sample frame is dividedinto smaller blocks called subframes. Usually the number of subframes isthree or four resulting in 4-10 ms subframes. In each subframe, anexcitation signal is usually obtained from two components, the pastexcitation and the innovative, fixed-codebook excitation. The componentformed from the past excitation is often referred to as the adaptivecodebook or pitch excitation. The parameters characterizing theexcitation signal are coded and transmitted to the decoder, where thereconstructed excitation signal is used as the input of the LP filter.

As the main applications of low bit rate speech encoding are wirelessmobile communication systems and voice over packet networks, thenincreasing the robustness of speech codecs in case of frame erasuresbecomes of significant importance. In wireless cellular systems, theenergy of the received signal can exhibit frequent severe fadesresulting in high bit error rates and this becomes more evident at thecell boundaries. In this case the channel decoder fails to correct theerrors in the received frame and as a consequence, the error detectorusually used after the channel decoder will declare the frame as erased.In voice over packet network applications, the speech signal ispacketized where usually a 20 ms frame is placed in each packet. Inpacket-switched communications, a packet dropping can occur at a routerif the number of packets become very large, or the packet can reach thereceiver after a long delay and it should be declared as lost if itsdelay is more than the length of a jitter buffer at the receiver side.In these systems, the codec is subjected to typically 3 to 5% frameerasure rates. Furthermore, the use of wideband speech encoding is animportant asset to these systems in order to allow them to compete withtraditional PSTN (public switched telephone network) that uses thelegacy narrow band speech signals.

The adaptive codebook, or the pitch predictor, in CELP plays animportant role in maintaining high speech quality at low bit rates.However, since the content of the adaptive codebook is based on thesignal from past frames, this makes the codec model sensitive to frameloss. In case of erased or lost frames, the content of the adaptivecodebook at the decoder becomes different from its content at theencoder. Thus, after a lost frame is concealed and consequent goodframes are received, the synthesized signal in the received good framesis different from the intended synthesis signal since the adaptivecodebook contribution has been changed. The impact of a lost framedepends on the nature of the speech segment in which the erasureoccurred. If the erasure occurs in a stationary segment of the signalthen an efficient frame erasure concealment can be performed and theimpact on consequent good frames can be minimized. On the other hand, ifthe erasure occurs in an speech onset or a transition, the effect of theerasure can propagate through several frames. For instance, if thebeginning of a voiced segment is lost, then the first pitch period willbe missing from the adaptive codebook content. This will have a severeeffect on the pitch predictor in consequent good frames, resulting inlong time before the synthesis signal converge to the intended one atthe encoder.

SUMMARY OF THE INVENTION

The present invention relates to a method for improving concealment offrame erasure caused by frames of an encoded sound signal erased duringtransmission from an encoder to a decoder, and for accelerating recoveryof the decoder after non erased frames of the encoded sound signal havebeen received, comprising:

-   -   determining, in the encoder, concealment/recovery parameters;    -   transmitting to the decoder the concealment/recovery parameters        determined in the encoder; and    -   in the decoder, conducting erasure frame concealment and decoder        recovery in response to the received concealment/recovery        parameters.

The present invention also relates to a method for the concealment offrame erasure caused by frames erased during transmission of a soundsignal encoded under the form of signal-encoding parameters from anencoder to a decoder, and for accelerating recovery of the decoder afternon erased frames of the encoded sound signal have been received,comprising:

-   -   determining, in the decoder, concealment/recovery parameters        from the signal-encoding parameters;    -   in the decoder, conducting erased frame concealment and decoder        recovery in response to the determined concealment/recovery        parameters.

In accordance with the present invention, there is also provided adevice for improving concealment of frame erasure caused by frames of anencoded sound signal erased during transmission from an encoder to adecoder, and for accelerating recovery of the decoder after non erasedframes of the encoded sound signal have been received, comprising:

-   -   means for determining, in the encoder, concealment/recovery        parameters;    -   means for transmitting to the decoder the concealment/recovery        parameters determined in the encoder; and    -   in the decoder, means for conducting erasure frame concealment        and decoder recovery in response to the received        concealment/recovery parameters.

According to the invention, there is further provided a device for theconcealment of frame erasure caused by frames erased during transmissionof a sound signal encoded under the form of signal-encoding parametersfrom an encoder to a decoder, and for accelerating recovery of thedecoder after non erased frames of the encoded sound signal have beenreceived, comprising:

-   -   means, for determining, in the decoder, concealment/recovery        parameters from the signal-encoding parameters;    -   in the decoder, means for conducting erased frame concealment        and decoder recovery in response to the determined        concealment/recovery parameters.

The present invention is also concerned with a system for encoding anddecoding a sound signal, and a sound signal decoder using the abovedefined devices for improving concealment of frame erasure caused byframes of the encoded sound signal erased during transmission from theencoder to the decoder, and for accelerating recovery of the decoderafter non erased frames of the encoded sound signal have been received.

The foregoing and other objects, advantages and features of the presentinvention will become more apparent upon reading of the following nonrestrictive description of illustrative embodiments thereof, given byway of example only with reference to the accompanying drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a schematic block diagram of a speech communication systemillustrating an application of speech encoding and decoding devices inaccordance with the present invention;

FIG. 2 is a schematic block diagram of an example of wideband encodingdevice (AMR-WB encoder);

FIG. 3 is a schematic block diagram of an example of wideband decodingdevice (AMR-WB decoder);

FIG. 4 is a simplified block diagram of the AMR-WB encoder of FIG. 2,wherein the down-sampler module, the high-pass filter module and thepre-emphasis filter module have been grouped in a single pre-processingmodule, and wherein the closed-loop pitch search module, the zero-inputresponse calculator module, the impulse response generator module, theinnovative excitation search module and the memory update module havebeen grouped in a single closed-loop pitch and innovative codebooksearch module;

FIG. 5 is an extension of the block diagram of FIG. 4 in which modulesrelated to an illustrative embodiment of the present invention have beenadded;

FIG. 6 is a block diagram explaining the situation when an artificialonset is constructed; and

FIG. 7 is a schematic diagram showing an illustrative embodiment of aframe classification state machine for the erasure concealment.

DETAILED DESCRIPTION OF THE ILLUSTRATIVE EMBODIMENTS

Although the illustrative embodiments of the present invention will bedescribed in the following description in relation to a speech signal,it should be kept in mind that the concepts of the present inventionequally apply to other types of-signal, in particular but notexclusively to other types of sound signals.

FIG. 1 illustrates a speech communication system 100 depicting the useof speech encoding and decoding in the context of the present invention.The speech communication system 100 of FIG. 1 supports transmission of aspeech signal across a communication channel 101. Although it maycomprise for example a wire, an optical link or a fiber link, thecommunication channel 101 typically comprises at least in part a radiofrequency link. The radio frequency link often supports multiple,simultaneous speech communications requiring shared bandwidth resourcessuch as may be found with cellular telephony systems. Although notshown, the communication channel 101 may be replaced by a storage devicein a single device embodiment of the system 100 that records and storesthe encoded speech signal for later playback.

In the speech communication system 100 of FIG. 1, a microphone 102produces an analog speech signal 103 that is supplied to ananalog-to-digital (A/D) converter 104 for converting it into a digitalspeech signal 105. A speech encoder 106 encodes the digital speechsignal 105 to produce a set of signal-encoding parameters 107 that arecoded into binary form and delivered to a channel encoder 108. Theoptional channel encoder 108 adds redundancy to the binaryrepresentation of the signal-encoding parameters 107 before transmittingthem over the communication channel 101.

In the receiver, a channel decoder 109 utilizes the said redundantinformation in the received bit stream 111 to detect and correct channelerrors that occurred during the transmission. A speech decoder 110converts the bit stream 112 received from the channel decoder 109 backto a set of signal-encoding parameters and creates from the recoveredsignal-encoding parameters a digital synthesized speech signal 113. Thedigital synthesized speech signal 113 reconstructed at the speechdecoder 110 is converted to an analog form 114 by a digital-to-analog(D/A) converter 115 and played back through a loudspeaker unit 116.

The illustrative embodiment of efficient frame erasure concealmentmethod disclosed in the present specification can be used with eithernarrowband or wideband linear prediction based codecs. The presentillustrative embodiment is disclosed in relation to a wideband speechcodec that has been standardized by the International TelecommunicationsUnion (ITU) as Recommendation G.722.2 and known as the AMR-WB codec(Adaptive Multi-Rate Wideband codec) [ITU-T Recommendation G.722.2“Wideband coding of speech at around 16 kbit/s using Adaptive Multi-RateWideband (AMR-WB)”, Geneva, 2002]. This codec has also been selected bythe third generation partnership project (3GPP) for wideband telephonyin third generation wireless systems [3GPP TS 26.190, “AMR WidebandSpeech Codec: Transcoding Functions,” 3GPP Technical Specification].AMR-WB can operate at 9 bit rates ranging from 6.6 to 23.85 kbit/s. Thebit rate of 12.65 kbit/s is used to illustrate the present invention.

Here, it should be understood that the illustrative embodiment ofefficient frame erasure concealment method could be applied to othertypes of codecs.

In the following sections, an overview of the AMR-WB encoder and decoderwill be first given. Then, the illustrative embodiment of the novelapproach to improve the robustness of the codec will be disclosed.

Overview of the AMR-WB Encoder

The sampled speech signal is encoded on a block by block basis by theencoding device 200 of FIG. 2 which is broken down into eleven modulesnumbered from 201 to 211.

The input speech signal 212 is therefore processed on a block-by-blockbasis, i.e. in the above-mentioned L-sample blocks called frames.

Referring to FIG. 2, the sampled input speech signal 212 is down-sampledin a down-sampler module 201. The signal is down-sampled from 16 kHzdown to 12.8 kHz, using techniques well known to those of ordinaryskilled in the art. Down-sampling increases the coding efficiency, sincea smaller frequency bandwidth is encoded. This also reduces thealgorithmic complexity since the number of samples in a frame isdecreased. After down-sampling, the 320-sample frame of 20 ms is reducedto a 256-sample frame (down-sampling ratio of 4/5).

The input frame is then supplied to the optional pre-processing module202. Pre-processing module 202 may consist of a high-pass filter with a50 Hz cut-off frequency. High-pass filter 202 removes the unwanted soundcomponents below 50 Hz.

The down-sampled, pre-processed signal is denoted by s_(p)(n), n=0, 1,2, . . . , L−1, where L is the length of the frame (256 at a samplingfrequency of 12.8 kHz). In an illustrative embodiment of the preemphasisfilter 203, the signal s_(p)(n) is preemphasized using a filter havingthe following transfer function:P(z)=1−μz ⁻¹where μ is a preemphasis factor with a value located between 0 and 1 (atypical value is μ=0.7). The function of the preemphasis filter 203 isto enhance the high frequency contents of the input speech signal. Italso reduces the dynamic range of the input speech signal, which rendersit more suitable for fixed-point implementation. Preemphasis also playsan important role in achieving a proper overall perceptual weighting ofthe quantization error, which contributes to improved sound quality.This will be explained in more detail herein below.

The output of the preemphasis filter 203 is denoted s(n). This signal isused for performing LP analysis in module 204. LP analysis is atechnique well known to those of ordinary skill in the art. In thisillustrative implementation, the autocorrelation approach is used. Inthe autocorrelation approach, the signal s(n) is first windowed using,typically, a Hamming window having a length of the order of 30-40 ms.The autocorrelations are computed from the windowed signal, andLevinson-Durbin recursion is used to compute LP filter coefficients,a_(i), where i=1, . . . , p, and where p is the LP order, which istypically 16 in wideband coding. The parameters a_(i) are thecoefficients of the transfer function A(z) of the LP filter, which isgiven by the following relation:${A(z)} = {1 + {\sum\limits_{i = 1}^{p}{a_{i}z^{- i}}}}$

LP analysis is performed in module 204, which also performs thequantization and interpolation of the LP filter coefficients. The LPfilter coefficients are first transformed into another equivalent domainmore suitable for quantization and interpolation purposes. The linespectral pair (LSP) and immitance spectral pair (ISP) domains are twodomains in which quantization and interpolation can be efficientlyperformed. The 16 LP filter coefficients, a_(i), can be quantized in theorder of 30 to 50 bits using split or multi-stage quantization, or acombination thereof. The purpose of the interpolation is to enableupdating the LP filter coefficients every subframe while transmittingthem once every frame, which improves the encoder performance withoutincreasing the bit rate. Quantization and interpolation of the LP filtercoefficients is believed to be otherwise well known to those of ordinaryskill in the art and, accordingly, will not be further described in thepresent specification.

The following paragraphs will describe the rest of the coding operationsperformed on a subframe basis. In this illustrative implementation, theinput frame is divided into 4 subframes of 5 ms (64 samples at thesampling frequency of 12.8 kHz). In the following description, thefilter A(z) denotes the unquantized interpolated LP filter of thesubframe, and the filter Â(z) denotes the quantized interpolated LPfilter of the subframe. The filter Â(z) is supplied every subframe to amultiplexer 213 for transmission through a communication channel.

In analysis-by-synthesis encoders, the optimum pitch and innovationparameters are searched by minimizing the mean squared error between theinput speech signal 212 and a synthesized speech signal in aperceptually weighted domain. The weighted signal s_(w)(n) is computedin a perceptual weighting filter 205 in response to the signal s(n) fromthe pre-emphasis filter 203. A perceptual weighting filter 205 withfixed denominator, suited for wideband signals, is used. An example oftransfer function for the perceptual weighting filter 205 is given bythe following relation:W(z)=A(z/γ ₁)/(1−γ₂ z ⁻¹) where 0<γ₂<γ₁≦1

In order to simplify the pitch analysis, an open-loop pitch lag T_(OL)is first estimated in an open-loop pitch search module 206 from theweighted speech signal s_(w)(n). Then the closed-loop pitch analysis,which is performed in a closed-loop pitch search module 207 on asubframe basis, is restricted around the open-loop pitch lag T_(OL)which significantly reduces the search complexity of the LTP parametersT (pitch lag) and b (pitch gain) The open-loop pitch analysis is usuallyperformed in module 206 once every 10 ms (two subframes) usingtechniques well known to those of ordinary skill in the art.

The target vector x for LTP (Long Term Prediction) analysis is firstcomputed. This is usually done by subtracting the zero-input response s₀of weighted synthesis filter W(z)/Â(z) from the weighted speech signals_(w)(n). This zero-input response s₀ is calculated by a zero-inputresponse calculator 208 in response to the quantized interpolation LPfilter Â(z) from the LP analysis, quantization and interpolation module204 and to the initial states of the weighted synthesis filter W(z)Â(z)stored in memory update module 211 in response to the LP filters A(z)and Â(z), and the excitation vector u. This operation is well known tothose of ordinary skill in the art and, accordingly, will not be furtherdescribed.

A N-dimensional impulse response vector h of the weighted synthesisfilter W(z)/Â(z) is computed in the impulse response generator 209 usingthe coefficients of the LP filter A(z) and Â(z) from module 204. Again,this operation is well known to those of ordinary skill in the art and,accordingly, will not be further described in the present specification.

The closed-loop pitch (or pitch codebook) parameters b, T and j arecomputed in the closed-loop pitch search module 207, which uses thetarget vector x, the impulse response vector h and the open-loop pitchlag T_(OL) as inputs.

The pitch search consists of finding the best pitch lag T and gain bthat minimize a mean squared weighted pitch prediction error, forexamplee ^((j)) =∥x−b ^((j)) y ^((j))∥² where j=1, 2, . . . , kbetween the target vector x and a scaled filtered version of the pastexcitation.

More specifically, in the present illustrative implementation, the pitch(pitch codebook) search is composed of three stages.

In the first stage, an open-loop pitch lag T_(OL) is estimated in theopen-loop pitch search module 206 in response to the weighted speechsignal s_(w)(n). As indicated in the foregoing description, thisopen-loop pitch analysis is usually performed once every 10 ms (twosubframes) using techniques well known to those of ordinary skill in theart.

In the second stage, a search criterion C is searched in the closed-looppitch search module 207 for integer pitch lags around the estimatedopen-loop pitch lag T_(OL) (usually ±5), which significantly simplifiesthe search procedure. A simple procedure is used for updating thefiltered codevector y_(T) (this vector is defined in the followingdescription) without the need to compute the convolution for every pitchlag. An example of search criterion C is given by:$C = \frac{x^{t}y_{T}}{\sqrt{y_{T}^{t}y_{T}}}$where t denotes vector transpose Once an optimum integer pitch lag isfound in the second stage, a third stage of the search (module 207)tests, by means of the search criterion C, the fractions around thatoptimum integer pitch lag. For example, the AMR-WB standard uses ¼ and ½subsample resolution.

In wideband signals, the harmonic structure exists only up to a certainfrequency, depending on the speech segment. Thus, in order to achieveefficient representation of the pitch contribution in voiced segments ofa wideband speech signal, flexibility is needed to vary the amount ofperiodicity over the wideband spectrum. This is achieved by processingthe pitch codevector through a plurality of frequency shaping filters(for example low-pass or band-pass filters). And the frequency shapingfilter that minimizes the mean-squared weighted error e^((j)) isselected. The selected frequency shaping filter is identified by anindex j.

The pitch codebook index T is encoded and transmitted to the multiplexer213 for transmission through a communication channel. The pitch gain bis quantized and transmitted to the multiplexer 213. An extra bit isused to encode the index j, this extra bit being also supplied to themultiplexer 213.

Once the pitch, or LTP (Long Term Prediction) parameters b, T, and j aredetermined, the next step is to search for the optimum innovativeexcitation by means of the innovative excitation search module 210 ofFIG. 2. First, the target vector x is updated by subtracting the LTPcontribution:x′=x−by _(T)where b is the pitch gain and y_(T) is the filtered pitch codebookvector (the past excitation at delay T filtered with the selectedfrequency shaping filter (index j) filter and convolved with the impulseresponse h).

The innovative excitation search procedure in CELP is performed in aninnovation codebook to find the optimum excitation codevector c_(k) andgain g which minimize the mean-squared error E between the target vectorx′ and a scaled filtered version of the codevector c_(k), for example:E=λx′−gHc _(k)∥²where H is a lower triangular convolution matrix derived from theimpulse response vector h. The index k of the innovation codebookcorresponding to the found optimum codevector c_(k) and the gain g aresupplied to the multiplexer 213 for transmission through a communicationchannel.

It should be noted that the used innovation codebook is a dynamiccodebook consisting of an algebraic codebook followed by an adaptiveprefilter F(z) which enhances special spectral components in order toimprove the synthesis speech quality, according to U.S. Pat. No.5,444,816 granted to Adoul et al. on Aug. 22, 1995. In this illustrativeimplementation, the innovative codebook search is performed in module210 by means of an algebraic codebook as described in U.S. Pat. No.5,444,816 (Adoul et al.) issued on Aug. 22, 1995; U.S. Pat. No.5,699,482 granted to Adoul et al., on Dec. 17, 1997; U.S. Pat. No.5,754,976 granted to Adoul et al., on May 19, 1998; and U.S. Pat. No.5,701,392 (Adoul et al.) dated Dec. 23, 1997.

Overview of AMR-WB Decoder

The speech decoder 300 of FIG. 3 illustrates the various steps carriedout between the digital input 322 (input bit stream to the demultiplexer317) and the output sampled speech signal 323 (output of the adder 321).

Demultiplexer 317 extracts the synthesis model parameters from thebinary information (input bit stream 322) received from a digital inputchannel. From each received binary frame, the extracted parameters are:

-   -   the quantized, interpolated LP coefficients Â(z) also called        short-term prediction parameters (STP) produced once per frame;    -   the long-term prediction (LTP) parameters T, b, and j (for each        subframe); and    -   the innovation codebook index k and gain g (for each subframe).

The current speech signal is synthesized based on these parameters aswill be explained hereinbelow.

The innovation codebook 318 is responsive to the index k to produce theinnovation codevector c_(k), which is scaled by the decoded gain factorg through an amplifier 324. In the illustrative implementation, aninnovation codebook as described in the above mentioned U.S. Pat. Nos.5,444,816; 5,699,482; 5,754,976; and 5,701,392 is used to produce theinnovative codevector ck.

The generated scaled codevector at the output of the amplifier 324 isprocessed through a frequency-dependent pitch enhancer 305.

Enhancing the periodicity of the excitation signal u improves thequality of voiced segments. The periodicity enhancement is achieved byfiltering the innovative codevector c_(k) from the innovation (fixed)codebook through an innovation filter F(z) (pitch enhancer 305) whosefrequency response emphasizes the higher frequencies more than the lowerfrequencies. The coefficients of the innovation filter F(z) are relatedto the amount of periodicity in the excitation signal u.

An efficient, illustrative way to derive the coefficients of theinnovation filter F(z) is to relate them to the amount of pitchcontribution in the total excitation signal u. This results in afrequency response depending on the subframe periodicity, where higherfrequencies are more strongly emphasized (stronger overall slope) forhigher pitch gains. The innovation filter 305 has the effect of loweringthe energy of the innovation codevector ck at lower frequencies when theexcitation signal u is more periodic, which enhances the periodicity ofthe excitation signal u at lower frequencies more than higherfrequencies. A suggested form for the innovation filter 305 is thefollowing:F(z)=−αz+1−αz ⁻¹where α is a periodicity factor derived from the level of periodicity ofthe excitation signal u. The periodicity factor α is computed in thevoicing factor generator 304. First, a voicing factor r_(v) is computedin voicing factor generator 304 by:r _(v)=(E _(v) −E _(c))/(E _(v) +E _(c))where E_(v) is the energy of the scaled pitch codevector bv_(T) andE_(C) is the energy of the scaled innovative codevector gc_(k). That is:$\begin{matrix}{E_{v} = {{b^{2}v_{T}^{t}v_{T}} = {b^{2}\quad{\sum\limits_{n = 0}^{N - 1}{v_{T}^{2}(n)}}}}} \\{and} \\{E_{c} = {{g^{2}c_{k}^{t}c_{k}} = {g^{2}\quad{\sum\limits_{n = 0}^{N - 1}{c_{k}^{2}(n)}}}}}\end{matrix}$

Note that the value of r_(v) lies between −1 and 1 (1 corresponds topurely voiced signals and −1 corresponds to purely unvoiced signals).

The above mentioned scaled pitch codevector bv_(T) is produced byapplying the pitch delay T to a pitch codebook 301 to produce a pitchcodevector. The pitch codevector is then processed through a low-passfilter 302 whose cut-off frequency is selected in relation to index jfrom the demultiplexer 317 to produce the filtered pitch codevectorv_(T). Then, the filtered pitch codevector v_(T) is then amplified bythe pitch gain b by an amplifier 326 to produce the scaled pitchcodevector bv_(T).

In this illustrative implementation, the factor α is then computed invoicing factor generator 304 by:α=0.125(1+r _(V))which corresponds to a value of 0 for purely unvoiced signals and 0.25for purely voiced signals.

The enhanced signal c_(f) is therefore computed by filtering the scaledinnovative codevector gc_(k) through the innovation filter 305 (F(z)).

The enhanced excitation signal u′ is computed by the adder 320 as:u′=c _(f) +bv _(T)

It should be noted that this process is not performed at the encoder200. Thus, it is essential to update the content of the pitch codebook301 using the past value of the excitation signal u without enhancementstored in memory 303 to keep synchronism between the encoder 200 anddecoder 300. Therefore, the excitation signal u is used to update thememory 303 of the pitch codebook 301 and the enhanced excitation signalu′ is used at the input of the LP synthesis filter 306.

The synthesized signal s′ is computed by filtering the enhancedexcitation signal u′ through the LP synthesis filter 306 which has theform 1/Â(z), where Â(z) is the quantized, interpolated LP filter in thecurrent subframe. As can be seen in FIG. 3, the quantized, interpolatedLP coefficients Â(z) on line 325 from the demultiplexer 317 are suppliedto the LP synthesis filter 306 to adjust the parameters of the LPsynthesis filter 306 accordingly. The deemphasis filter 307 is theinverse of the preemphasis filter 203 of FIG. 2. The transfer functionof the deemphasis filter 307 is given byD(z)=1/(1−μz ⁻¹)where μ is a preemphasis factor with a value located between 0 and 0.1(a typical value is μ=0.7). A higher-order filter could also be used.

The vector s′ is filtered through the deemphasis filter D(z) 307 toobtain the vector s_(d), which is processed through the high-pass filter308 to remove the unwanted frequencies below 50 Hz and further obtains_(h).

The oversampler 309 conducts the inverse process of the downsampler 201of FIG. 2. In this illustrative embodiment, over-sampling converts the12.8 kHz sampling rate back to the original 16 kHz sampling rate, usingtechniques well known to those of ordinary skill in the art. Theoversampled synthesis signal is denoted ŝ. Signal ŝ is also referred toas the synthesized wideband intermediate signal.

The oversampled synthesis signal ŝ does not contain the higher frequencycomponents which were lost during the downsampling process (module 201of FIG. 2) at the encoder 200. This gives a low-pass perception to thesynthesized speech signal. To restore the full band of the originalsignal, a high frequency generation procedure is performed in module 310and requires input from voicing factor generator 304 (FIG. 3).

The resulting band-pass filtered noise sequence z from the highfrequency generation module 310 is added by the adder 321 to theoversampled synthesized speech signal ŝ to obtain the finalreconstructed output speech signal s_(out) on the output 323. An exampleof high frequency regeneration process is described in International PCTpatent application published under No. WO 00/25305 on May 4, 2000.

The bit allocation of the AMR-WB codec at 12.65 kbit/s is given inTable 1. TABLE 1 Bit allocation in the 12.65-kbit/s mode Parameter Bits/ Frame LP Parameters  46 Pitch Delay  30 = 9 + 6 + 9 + 6 PitchFiltering  4 = 1 + 1 + 1 + 1 Gains  28 = 7 + 7 + 7 + 7 AlgebraicCodebook 144 = 36 + 36 + 36 + 36 Mode Bit  1 Total 253 bits = 12.65kbit/sRobust Frame Erasure Concealment

The erasure of frames has a major effect on the synthesized speechquality in digital speech communication systems, especially whenoperating in wireless environments and packet-switched networks. Inwireless cellular systems, the energy of the received signal can exhibitfrequent severe fades resulting in high bit error rates and this becomesmore evident at the cell boundaries. In this case the channel decoderfails to correct the errors in the received frame and as a consequence,the error detector usually used after the channel decoder will declarethe frame as erased. In voice over packet network applications, such asVoice over Internet Protocol (VoIP), the speech signal is packetizedwhere usually a 20 ms frame is placed in each packet. In packet-switchedcommunications, a packet dropping can occur at a router if the number ofpackets becomes very large, or the packet can arrive at the receiverafter a long delay and it should be declared as lost if its delay ismore than the length of a jitter buffer at the receiver side. In thesesystems, the codec is subjected to typically 3 to 5% frame erasurerates.

The problem of frame erasure (FER) processing is basically twofold.First, when an erased frame indicator arrives, the missing frame must begenerated by using the information sent in the previous frame and byestimating the signal evolution in the missing frame. The success of theestimation depends not only on the concealment strategy, but also on theplace in the speech signal where the erasure happens. Secondly, a smoothtransition must be assured when normal operation recovers, i.e. when thefirst good frame arrives after a block of erased frames (one or more).This is not a trivial task as the true synthesis and the estimatedsynthesis can evolve differently. When the first good frame arrives, thedecoder is hence desynchronized from the encoder. The main reason isthat low bit rate encoders rely on pitch prediction, and during erasedframes, the memory of the pitch predictor is no longer the same as theone at the encoder. The problem is amplified when many consecutiveframes are erased. As for the concealment, the difficulty of the normalprocessing recovery depends on the type of speech signal where theerasure occurred.

The negative effect of frame erasures can be significantly reduced byadapting the concealment and the recovery of normal processing (furtherrecovery) to the type of the speech signal where the erasure occurs. Forthis purpose, it is necessary to classify each speech frame. Thisclassification can be done at the encoder and transmitted.Alternatively, it can be estimated at the decoder.

For the best concealment and recovery, there are few criticalcharacteristics of the speech signal that must be carefully controlled.These critical characteristics are the signal energy or the amplitude,the amount of periodicity, the spectral envelope and the pitch period.In case of a voiced speech recovery, further improvement can be achievedby a phase control. With a slight increase in the bit rate, fewsupplementary parameters can be quantized and transmitted for bettercontrol. If no additional bandwidth is available, the parameters can beestimated at the decoder. With these parameters controlled, the frameerasure concealment and recovery can be significantly improved,especially by improving the convergence of the decoded signal to theactual signal at the encoder and alleviating the effect of mismatchbetween the encoder and decoder when normal processing recovers.

In the present illustrative embodiment of the present invention, methodsfor efficient frame erasure concealment, and methods for extracting andtransmitting parameters that will improve the performance andconvergence at the decoder in the frames following an erased frame aredisclosed. These parameters include two or more of the following: frameclassification, energy, voicing information, and phase information.Further, methods for extracting such parameters at the decoder iftransmission of extra bits is not possible, are disclosed. Finally,methods for improving the decoder convergence in good frames followingan erased frame are also disclosed.

The frame erasure concealment techniques according to the presentillustrative embodiment have been applied to the AMR-WB codec describedabove. This codec will serve as an example framework for theimplementation of the FER concealment methods in the followingdescription. As explained above, the input speech signal 212 to thecodec has a 16 kHz sampling frequency, but it is downsampled to a 12.8kHz sampling frequency before further processing. In the presentillustrative embodiment, FER processing is done on the downsampledsignal.

FIG. 4 gives a simplified block diagram of the AMR-WB encoder 400. Inthis simplified block diagram, the downsampler 201, high-pass filter 202and preemphasis filter 203 are grouped together in the preprocessingmodule 401. Also, the closed-loop search module 207, the zero-inputresponse calculator 208, the impulse response calculator 209, theinnovative excitation search module 210, and the memory update module211 are grouped in a closed-loop pitch and innovation codebook searchmodules 402. This grouping is done to simplify the introduction of thenew modules related to the illustrative embodiment of the presentinvention.

FIG. 5 is an extension of the block diagram of FIG. 4 where the modulesrelated to the illustrative embodiment of the present invention areadded. In these added modules 500 to 507, additional parameters arecomputed, quantized, and transmitted with the aim to improve the FERconcealment and the convergence and recovery of the decoder after erasedframes. In the present illustrative embodiment, these parameters includesignal classification, energy, and phase information (the estimatedposition of the first glottal pulse in a frame).

In the next sections, computation and quantization of these additionalparameters will be given in detail and become more apparent withreference to FIG. 5. Among these parameters, signal classification willbe treated in more detail. In the subsequent sections, efficient FERconcealment using these additional parameters to improve the convergencewill be explained.

Signal Classification for FER Concealment and Recovery

The basic idea behind using a classification of the speech for a signalreconstruction in the presence of erased frames consists of the factthat the ideal concealment strategy is different for quasi-stationaryspeech segments and for speech segments with rapidly changingcharacteristics. While the best processing of erased frames innon-stationary speech segments can be summarized as a rapid convergenceof speech-encoding parameters to the ambient noise characteristics, inthe case of quasi-stationary signal, the speech-encoding parameters donot vary dramatically and can be kept practically unchanged duringseveral adjacent erased frames before being damped. Also, the optimalmethod for a signal recovery following an erased block of frames varieswith the classification of the speech signal.

The speech signal can be roughly classified as voiced, unvoiced andpauses. Voiced speech contains an important amount of periodiccomponents and can be further divided in the following categories:voiced onsets, voiced segments, voiced transitions and voiced offsets. Avoiced onset is defined as a beginning of a voiced speech segment aftera pause or an unvoiced segment. During voiced segments, the speechsignal parameters (spectral envelope, pitch period, ratio of periodicand non-periodic components, energy) vary slowly from frame to frame. Avoiced transition is characterized by rapid variations of a voicedspeech, such as a transition between vowels. Voiced offsets arecharacterized by a gradual decrease of energy and voicing at the end ofvoiced segments.

The unvoiced parts of the signal are characterized by missing theperiodic component and can be further divided into unstable frames,where the energy and the spectrum changes rapidly, and stable frameswhere these characteristics remain relatively stable. Remaining framesare classified as silence. Silence frames comprise all frames withoutactive speech, i.e. also noise-only frames if a background noise ispresent.

Not all of the above mentioned classes need a separate processing.Hence, for the purposes of error concealment techniques, some of thesignal classes are grouped together.

Classification at the Encoder

When there is an available bandwidth in the bitstream to include theclassification information, the classification can be done at theencoder. This has several advantages. The most important is that thereis often a look-ahead in speech encoders. The look-ahead permits toestimate the evolution of the signal in the following frame andconsequently the classification can be done by taking into account thefuture signal behavior. Generally, the longer is the look-ahead, thebetter can be the classification. A further advantage is a complexityreduction, as most of the signal processing necessary for frame erasureconcealment is needed anyway for speech encoding. Finally, there is alsothe advantage to work with the original signal instead of thesynthesized signal.

The frame classification is done with the consideration of theconcealment and recovery strategy in mind. In other words, any frame isclassified in such a way that the concealment can be optimal if thefollowing frame is missing, or that the recovery can be optimal if theprevious frame was lost. Some of the classes used for the FER processingneed not be transmitted, as they can be deduced without ambiguity at thedecoder. In the present illustrative embodiment, five (5) distinctclasses are used, and defined as follows:

UNVOICED class comprises all unvoiced speech frames and all frameswithout active speech. A voiced offset frame can be also classified asUNVOICED if its end tends to be unvoiced and the concealment designedfor unvoiced frames can be used for the following frame in case it islost.

UNVOICED TRANSITION class comprises unvoiced frames with a possiblevoiced onset at the end. The onset is however still too short or notbuilt well enough to use the concealment designed for voiced frames. TheUNVOICED TRANSITION class can follow only a frame classified as UNVOICEDor UNVOICED TRANSITION.

VOICED TRANSITION class comprises voiced frames with relatively weakvoiced characteristics. Those are typically voiced frames with rapidlychanging characteristics (transitions between vowels) or voiced offsetslasting the whole frame. The VOICED TRANSITION class can follow only aframe classified as VOICED TRANSITION, VOICED or ONSET.

VOICED class comprises voiced frames with stable characteristics. Thisclass can follow only a frame classified as VOICED TRANSITION, VOICED orONSET.

ONSET class comprises all voiced frames with stable characteristicsfollowing a frame classified as UNVOICED or UNVOICED TRANSITION. Framesclassified as ONSET correspond to voiced onset frames where the onset isalready sufficiently well built for the use of the concealment designedfor lost voiced frames. The concealment techniques used for a frameerasure following the ONSET class are the same as following the VOICEDclass. The difference is in the recovery strategy. If an ONSET classframe is lost (i.e. a VOICED good frame arrives after an erasure, butthe last good frame before the erasure was UNVOICED), a specialtechnique can be used to artificially reconstruct the lost onset. Thisscenario can be seen in FIG. 6. The artificial onset reconstructiontechniques will be described in more detail in the followingdescription. On the other hand if an ONSET good frame arrives after anerasure and the last good frame before the erasure was UNVOICED, thisspecial processing is not needed, as the onset has not been lost (hasnot been in the lost frame).

The classification state diagram is outlined in FIG. 7. If the availablebandwidth is sufficient, the classification is done in the encoder andtransmitted using 2 bits. As it can be seen from FIG. 7, UNVOICEDTRANSITION class and VOICED TRANSITION class can be grouped together asthey can be unambiguously differentiated at the decoder (UNVOICEDTRANSITION can follow only UNVOICED or UNVOICED TRANSITION frames,VOICED TRANSITION can follow only ONSET, VOICED or VOICED TRANSITIONframes). The following parameters are used for the classification: anormalized correlation rx, a spectral tilt measure et, a signal to noiseratio snr, a pitch stability counter pc, a relative frame energy of thesignal at the end of the current frame E_(s) and a zero-crossing counterzc. As can be seen in the following detailed analysis, the computationof these parameters uses the available look-ahead as much as possible totake into account the behavior of the speech signal also in thefollowing frame.

The normalized correlation r_(x) is computed as part of the open-looppitch search module 206 of FIG. 5. This module 206 usually outputs theopen-loop pitch estimate every 10 ms (twice per frame). Here, it is alsoused to output the normalized correlation measures. These normalizedcorrelations are computed on the current weighted speech signal s_(w)(n)and the past weighted speech signal at the open-loop pitch delay. Inorder to reduce the complexity, the weighted speech signal s_(w)(n) isdownsampled by a factor of 2 prior to the open-loop pitch analysis downto the sampling frequency of 6400 Hz [3GPP TS 26.190, “AMR WidebandSpeech Codec: Transcoding Functions,” 3GPP Technical Specification]. Theaverage correlation rx is defined as{tilde over (r)} _(x)=0.5(r _(x)(1)+r _(x)(2))(1)where r_(x)(1), r_(x)(2) are respectively the normalized correlation ofthe second half of the current frame and of the look-ahead. In thisillustrative embodiment, a look-ahead of 13 ms is used unlike the AMR-WBstandard that uses 5 ms. The normalized correlation r_(x)(k) is computedas follows: $\begin{matrix}\begin{matrix}{{{r_{x}(k)} = \frac{r_{x\quad y}}{\sqrt{r_{x\quad x},r_{y\quad y}}}},} \\{where} \\{r_{x\quad y} = {\sum\limits_{i = 0}^{{Lk} - 1}{{x\left( {t_{k} + i} \right)} \cdot {x\left( {t_{k} + i - p_{k}} \right)}}}} \\{r_{x\quad x} = {\sum\limits_{i = 0}^{{Lk} - 1}{x^{2}\left( {t_{k} + i} \right)}}} \\{r_{y\quad y} = {\sum\limits_{i = 0}^{{Lk} - 1}{x^{2}\left( {t_{k} + i - p_{k}} \right)}}}\end{matrix} & (2)\end{matrix}$

The correlations r_(x)(k) are computed using the weighted speech signals_(w)(n). The instants t_(k) are related to the current frame beginningand are equal to 64 and 128 samples respectively at the sampling rate orfrequency of 6.4 kHz (10 and 20 ms). The values p_(k)=T_(OL) are theselected open-loop pitch estimates. The length of the autocorrelationcomputation L_(k) is dependant on the pitch period. The values of L_(k)are summarized below (for the sampling rate of 6.4 kHz):

-   -   L_(k)=40 samples for p_(k)≦31 samples    -   L_(k)=62 samples for p_(k)≦61 samples    -   L_(k)=115 samples for p_(k)>61 samples

These lengths assure that the correlated vector length comprises atleast one pitch period which helps for a robust open-loop pitchdetection. For long pitch periods (p₁>61 samples), r_(x)(1) and r_(x)(2)are identical, i.e. only one correlation is computed since thecorrelated vectors are long enough so that the analysis on thelook-ahead is no longer necessary.

The spectral tilt parameter e_(t) contains the information about thefrequency distribution of energy. In the present illustrativeembodiment, the spectral tilt is estimated as a ratio between the energyconcentrated in low frequencies and the energy concentrated in highfrequencies. However, it can also be estimated in different ways such asa ratio between the two first autocorrelation coefficients of the speechsignal.

The discrete Fourier Transform is used to perform the spectral analysisin the spectral analysis and spectrum energy estimation module 500 ofFIG. 5. The frequency analysis and the tilt computation are done twiceper frame. 256 points Fast Fourier Transform (FFT) is used with a 50percent overlap. The analysis windows are placed so that all the lookahead is exploited. In this illustrative embodiment, the beginning ofthe first window is placed 24 samples after the beginning of the currentframe. The second window is placed 128 samples further. Differentwindows can be used to weight the input signal for the frequencyanalysis. A square root of a Hamming window (which is equivalent to asine window) has been used in the present illustrative embodiment. Thiswindow is particularly well suited for overlap-add methods. Therefore,this particular spectral analysis can be used in an optional noisesuppression algorithm based on spectral subtraction and overlap-addanalysis/synthesis.

The energy in high frequencies and in low frequencies is computed inmodule 500 of FIG. 5 following the perceptual critical bands. In thepresent illustrative embodiment each critical band is considered up tothe following number [J. D. Johnston, “Transform Coding of Audio SignalsUsing Perceptual Noise Criteria,” IEEE Jour. on Selected Areas inCommunications, vol. 6, no. 2, pp. 314-323]:

Critical bands {100.0, 200.0, 300.0, 400.0, 510.0, 630.0, 770.0, 920.0,1080.0, 1270.0, 1480.0, 1720.0, 2000.0, 2320.0, 2700.0, 3150.0, 3700.0,4400.0, 5300.0, 6350.0} Hz.

The energy in higher frequencies is computed in module 500 as theaverage of the energies of the last two critical bands:{overscore (E)} _(h)=0.5(e(18)+e(19))  (3)where the critical band energies e(i) are computed as a sum of the binenergies within the critical band, averaged by the number of the bins.

The energy in lower frequencies is computed as the average of theenergies in the first 10 critical bands. The middle critical bands havebeen excluded from the computation to improve the discrimination betweenframes with high energy concentration in low frequencies (generallyvoiced) and with high energy concentration in high frequencies(generally unvoiced). In between, the energy content is notcharacteristic for any of the classes and would increase the decisionconfusion.

In module 500, the energy in low frequencies is computed differently forlong pitch periods and short pitch periods. For voiced female speechsegments, the harmonic structure of the spectrum can be exploited toincrease the voiced-unvoiced discrimination. Thus for short pitchperiods, {overscore (E)}₁ is computed bin-wise and only frequency binssufficiently close to the speech harmonics are taken into account in thesummation, i.e. $\begin{matrix}{{\overset{\_}{E}}_{l} = {\frac{1}{cnt} \cdot {\sum\limits_{i = 0}^{24}{e_{b}(i)}}}} & (4)\end{matrix}$where e_(b)(i) are the bin energies in the first 25 frequency bins (theDC component is not considered). Note that these 25 bins correspond tothe first 10 critical bands. In the above summation, only terms relatedto the bins closer to the nearest harmonics than a certain frequencythreshold are non zero. The counter cnt equals to the number of thosenon-zero terms. The threshold for a bin to be included in the sum hasbeen fixed to 50 Hz, i.e. only bins closer than 50 Hz to the nearestharmonics are taken into account. Hence, if the structure is harmonic inlow frequencies, only high energy term will be included in the sum. Onthe other hand, if the structure is not harmonic, the selection of theterms will be random and the sum will be smaller. Thus even unvoicedsounds with high energy content in low frequencies can be detected. Thisprocessing cannot be done for longer pitch periods, as the frequencyresolution is not sufficient. The threshold pitch value is 128 samplescorresponding to 100 Hz. It means that for pitch periods longer than 128samples and also for a priori unvoiced sounds (i.e. when {overscore(r)}+re<0.6), the low frequency energy estimation is done per criticalband and is computed as $\begin{matrix}{{\overset{\_}{E}}_{l} = {\frac{1}{10} \cdot {\sum\limits_{i = 0}^{9}{e(i)}}}} & (5)\end{matrix}$

The value r_(e), calculated in a noise estimation and normalizedcorrelation correction module 501, is a correction added to thenormalized correlation in presence of background noise for the followingreason. In the presence of background noise, the average normalizedcorrelation decreases. However, for purpose of signal classification,this decrease should not affect the voiced-unvoiced decision. It hasbeen found that the dependence between this decrease re and the totalbackground noise energy in dB is approximately exponential and can beexpressed using following relationshipr _(e)=2.4492·10⁻⁴ ·e ^(0.1596·NdB)−0.022where N_(dB) stands for$N_{dB} = {{10 \cdot {\log_{10}\left( {\frac{1}{20}\quad{\sum\limits_{i = 0}^{19}{n(i)}}} \right)}} - g_{dB}}$

Here, n(i) are the noise energy estimates for each critical bandnormalized in the same way as e(i) and g_(dB) is the maximum noisesuppression level in dB allowed for the noise reduction routine. Thevalue re is not allowed to be negative. it should be noted that when agood noise reduction algorithm is used and g_(dB) is sufficiently high,re is practically equal to zero. It is only relevant when the noisereduction is disabled or if the background noise level is significantlyhigher than the maximum allowed reduction. The influence of re can betuned by multiplying this term with a constant.

Finally, the resulting lower and higher frequency energies are obtainedby subtracting an estimated noise energy from the values and {overscore(E)}₁ and {overscore (E)}₁ calculated above. That isE _(h) ={overscore (E)} _(h) −f _(c) ·N _(h)  (6)E ₁ {overscore (E)} ₁ −f _(c) ·N _(l)  (7)where N_(h) and N_(l) are the averaged noise energies in the last two(2) critical bands and first ten (10) critical bands, respectively,computed using equations similar to Equations (3) and (5), and f_(c) isa correction factor tuned so that these measures remain close toconstant with varying the background noise level. In this illustrativeembodiment, the value of f_(c) has been fixed to 3.

The spectral tilt et is calculated in the spectral tilt estimationmodule 503 using the relation: $\begin{matrix}{e_{t} = \frac{E_{l}}{E_{h}}} & (8)\end{matrix}$and it is averaged in the dB domain for the two (2) frequency analysesperformed per frame:e _(t)=10·log₁₀ (e _(t)(0)·e _(t)  (1))

The signal to noise ratio (SNR) measure exploits the fact that for ageneral waveform matching encoder, the SNR is much higher for voicedsounds. The snr parameter estimation must be done at the end of theencoder subframe loop and is computed in the SNR computation module 504using the relation: $\begin{matrix}{{snr} = \frac{E_{sw}}{E_{e}}} & (9)\end{matrix}$where E_(sw) is the energy of the weighted speech signal s_(w)(n) of thecurrent frame from the perceptual weighting filter 205 and E_(e) is theenergy of the error between this weighted speech signal and the weightedsynthesis signal of the current frame from the perceptual weightingfilter 205′.

The pitch stability counter PC assesses the variation of the pitchperiod. It is computed within the signal classification module 505 inresponse to the open-loop pitch estimates as follows:pc=|p ₁ −p ₀ |+|p ₂ −p ₁|  (10)

The values p₀, p₁, p₂ correspond to the open-loop pitch estimatescalculated by the open-loop pitch search module 206 from the first halfof the current frame, the second half of the current frame and thelook-ahead, respectively.

The relative frame energy E_(s) is computed by module 500 as adifference between the current frame energy in dB and its long-termaverageE _(s) ={overscore (E)} _(f) −E _(lt)where the frame energy {overscore (E)}_(f) is obtained as a summation ofthe critical band energies, averaged for the both spectral analysisperformed each frame:E _(f)=10log₁₀(0.5E _(f)(0)+E _(f)  (1)))${E_{f}(j)} = {\sum\limits_{i = 0}^{19}{e(i)}}$The long-term averaged energy is updated on active speech frames usingthe following relation:E _(lt)=0.99E _(lt)+0.01E _(f)

The last parameter is the zero-crossing parameter zc computed on oneframe of the speech signal by the zero-crossing computation module 508.The frame starts in the middle of the current frame and uses two (2)subframes of the look-ahead. In this illustrative embodiment, thezero-crossing counter zc counts the number of times the signal signchanges from positive to negative during that interval.

To make the classification more robust, the classification parametersare considered together forming a function of merit fm. For thatpurpose, the classification parameters are first scaled between 0 and 1so that each parameter's value typical for unvoiced signal translates in0 and each parameter's value typical for voiced signal translatesinto 1. A linear function is used between them. Let us consider aparameter px, its scaled version is obtained using:p ^(s) =k _(p) ·p _(x) +c _(p)

and clipped between 0 and 1. The function coefficients k_(p) and c_(p)have been found experimentally for each of the parameters so that thesignal distortion due to the concealment and recovery techniques used inpresence of FERs is minimal. The values used in this illustrativeimplementation are summarized in Table 2: TABLE 2 Signal ClassificationParameters and the coefficients of their respective scaling functionsParameter Meaning k_(p) c_(p) {overscore (r)}_(x) Normalized Correlation2.857 −1.286 {overscore (e)}_(t) Spectral Tilt 0.04167 0 snr Signal toNoise Ratio 0.1111 −0.3333 pc Pitch Stability counter −0.07143 1.857E_(s) Relative Frame Energy 0.05 0.45 zc Zero Crossing Counter −0.04 2.4

The merit function has been defined as:$f_{m} = {\frac{1}{7}\left( {{2 \cdot {\overset{\_}{r}}_{x}^{s}} + {\overset{\_}{e}}_{t}^{s} + {snr}^{s} + {p\quad c^{s}} + E_{s}^{s} + {z\quad c^{s}}} \right)}$where the superscript s indicates the scaled version of the parameters.

The classification is then done using the merit function f_(m) andfollowing the rules summarized in Table 3: TABLE 3 Signal ClassificationRules at the Encoder Previous Frame Class Rule Current Frame Class ONSETf_(m) = 0.66 VOICED VOICED VOICED TRANSITION 0.66 > f_(m) = 0.49 VOICEDTRANSITION UNVOICED f_(m) < 0.49 UNVOICED TRANSITION f_(m) > 0.63 ONSETUNVOICED 0.63 = f_(m) > 0.585 UNVOICED TRANSITION f_(m) = 0.585 UNVOICED

In case of source-controlled variable bit rate (VBR) encoder, a signalclassification is inherent to the codec operation. The codec operates atseveral bit rates, and a rate selection module is used to determine thebit rate used for encoding each speech frame based on the nature of thespeech frame (e.g. voiced, unvoiced, transient, background noise framesare each encoded with a special encoding algorithm). The informationabout the coding mode and thus about the speech class is already animplicit part of the bitstream and need not be explicitly transmittedfor FER processing. This class information can be then used to overwritethe classification decision described above.

In the example application to the AMR WB codec, the onlysource-controlled rate selection represents the voice activity detection(VAD). This VAD flag equals 1 for active speech, 0 for silence. Thisparameter is useful for the classification as it directly indicates thatno further classification is needed if its value is 0 (i.e. the frame isdirectly classified as UNVOICED). This parameter is the output of thevoice activity detection (VAD) module 402. Different VAD algorithmsexist in the literature and any algorithm can be used for the purpose ofthe present invention. For instance the VAD algorithm that is part ofstandard G.722.2 can be used [ITU-T Recommendation G.722.2 “Widebandcoding of speech at around 16 kbit/s using Adaptive Multi-Rate Wideband(AMR-WB)”, Geneva, 2002]. Here, the VAD algorithm is based on the outputof the spectral analysis of module 500 (based on signal-to-noise ratioper critical band). The VAD used for the classification purpose differsfrom the one used for encoding purpose with respect to the hangover. Inspeech encoders using a comfort noise generation (CNG) for segmentswithout active speech (silence or noise-only), a hangover is often addedafter speech spurts (CNG in AMR-WB standard is an example [3GPP TS26.192, “AMR Wideband Speech Codec: Comfort Noise Aspects,” 3GPPTechnical Specification]). During the hangover, the speech encodercontinues to be used and the system switches to the CNG only after thehangover period is over. For the purpose of classification for FERconcealment, this high security is not needed. Consequently, the VADflag for the classification will equal to 0 also during the hangoverperiod.

In this illustrative embodiment, the classification is performed inmodule 505 based on the parameters described above; namely, normalizedcorrelations (or voicing information) r_(x), spectral tilt e_(t), snr,pitch stability counter pc, relative frame energy E_(s), zero crossingrate zc, and VAD flag.

Classification at the Decoder

If the application does not permit the transmission of the classinformation (no extra bits can be transported), the classification canbe still performed at the decoder. As already noted, the maindisadvantage here is that there is generally no available look ahead inspeech decoders. Also, there is often a need to keep the decodercomplexity limited.

A simple classification can be done by estimating the voicing of thesynthesized signal. If we consider the case of a CELP type encoder, thevoicing estimate r_(v) computed as in Equation (1) can be used. That is:r _(v)=(E _(v) −E _(c))/(E _(v) +E _(c))

where E_(v) is the energy of the scaled pitch codevector bv_(T) andE_(c) is the energy of the scaled innovative codevector gc_(k).Theoretically, for a purely voiced signal rv=1 and for a purely unvoicedsignal r_(v)=−1. The actual classification is done by averaging r_(v)values every 4 subframes. The resulting factor f_(rv) (average of r_(v)values of every four subframes) is used as follows TABLE 4 SignalClassification Rules at the Decoder Previous Frame Class Rule CurrentFrame Class ONSET f_(rv) > −0.1 VOICED VOICED VOICED TRANSITION −0.1 =f_(rv) = −0.5 VOICED TRANSITION UNVOICED f_(rv) < −0.5 UNVOICEDTRANSITION UNVOICED f_(rv) > −0.1 ONSET −0.1 = f_(rv) = −0.5 UNVOICEDTRANSITION f_(rv) < −0.5 UNVOICED

Similarly to the classification at the encoder, other parameters can beused at the decoder to help the classification, as the parameters of theLP filter or the pitch stability.

In case of source-controlled variable bit rate coder, the informationabout the coding mode is already a part of the bitstream. Hence, if forexample a purely unvoiced coding mode is used, the frame can beautomatically classified as UNVOICED. Similarly, if a purely voicedcoding mode is used, the frame is classified as VOICED.

Speech Parameters for FER Processing

There are few critical parameters that must be carefully controlled toavoid annoying artifacts when FERs occur. If few extra bits can betransmitted then these parameters can be estimated at the encoder,quantized, and transmitted. Otherwise, some of them can be estimated atthe decoder. These parameters include signal classification, energyinformation, phase information, and voicing information. The mostimportant is a precise control of the speech energy. The phase and thespeech periodicity can be controlled too for further improving the FERconcealment and recovery.

The importance of the energy control manifests itself mainly when anormal operation recovers after an erased block of frames. As most ofspeech encoders make use of a prediction, the right energy cannot beproperly estimated at the decoder. In voiced speech segments, theincorrect energy can persist for several consecutive frames which isvery annoying especially when this incorrect energy increases.

Even if the energy control is most important for voiced speech becauseof the long term prediction (pitch prediction), it is important also forunvoiced speech. The reason here is the prediction of the innovationgain quantizer often used in CELP type coders. The wrong energy duringunvoiced segments can cause an annoying high frequency fluctuation.

The phase control can be done in several ways, mainly depending on theavailable bandwidth. In our implementation, a simple phase control isachieved during lost voiced onsets by searching the approximateinformation about the glottal pulse position.

Hence, apart from the signal classification information discussed in theprevious section, the most important information to send is theinformation about the signal energy and the position of the firstglottal pulse in a frame (phase information). If enough bandwidth isavailable, a voicing information can be sent, too.

Energy Information

The energy information can be estimated and sent either in the LPresidual domain or in the speech signal domain. Sending the informationin the residual domain has the disadvantage of not taking into accountthe influence of the LP synthesis filter. This can be particularlytricky in the case of voiced recovery after several lost voiced frames(when the FER happens during a voiced speech segment). When a FERarrives after a voiced frame, the excitation of the last good frame istypically used during the concealment with some attenuation strategy.When a new LP synthesis filter arrives with the first good frame afterthe erasure, there can be a mismatch between the excitation energy andthe gain of the LP synthesis filter. The new synthesis filter canproduce a synthesis signal with an energy highly different from theenergy of the last synthesized erased frame and also from the originalsignal energy. For this reason, the energy is computed and quantized inthe signal domain.

The energy E_(q) is computed and quantized in energy estimation andquantization module 506. It has been found that 6 bits are sufficient totransmit the energy. However, the number of bits can be reduced withouta significant effect if not enough bits are available. In this preferredembodiment, a 6 bit uniform quantizer is used in the range of −15 dB to83 dB with a step of 1.58 dB. The quantization index is given by theinteger part of: $\begin{matrix}{i = \frac{{10\quad{\log_{10}\left( {E + 0.001} \right)}} + 15}{1.58}} & (15)\end{matrix}$where E is the maximum of the signal energy for frames classified asVOICED or ONSET, or the average energy per sample for other frames. ForVOICED or ONSET frames, the maximum of signal energy is computed pitchsynchronously at the end of the frame as follow: $\begin{matrix}{E = {\max\limits_{i = {L - t_{E}}}^{L - 1}\left( {s^{2}(i)} \right)}} & (16)\end{matrix}$where L is the frame length and signal s(i) stands for speech signal (orthe denoised speech signal if a noise suppression is used). In thisillustrative embodiment s(i) stands for the input signal afterdownsampling to 12.8 kHz and pre-processing. If the pitch delay isgreater than 63 samples, t_(E) equals the rounded close-loop pitch lagof the last subframe. If the pitch delay is shorter than 64 samples,then t_(E) is set to twice the rounded close-loop pitch lag of the lastsubframe.

For other classes, E is the average energy per sample of the second halfof the current frame, i.e. t_(E) is set to L/2 and the E is computed as:$\begin{matrix}{E = {\frac{1}{t_{E}}\quad{\sum\limits_{i = {L - t_{E}}}^{L - 1}{s^{2}(i)}}}} & (17)\end{matrix}$

Phase Control Information

The phase control is particularly important while recovering after alost segment of voiced speech for similar reasons as described in theprevious section. After a block of erased frames, the decoder memoriesbecome desynchronized with the encoder memories. To resynchronize thedecoder, some phase information can be sent depending on the availablebandwidth. In the described illustrative implementation, a roughposition of the first glottal pulse in the frame is sent. Thisinformation is then used for the recovery after lost voiced onsets aswill be described later.

Let T₀ be the rounded closed-loop pitch lag for the first subframe.First glottal pulse search and quantization module 507 searches theposition of the first glottal pulse τ among the T₀ first samples of theframe by looking for the sample with the maximum amplitude. Best resultsare obtained when the position of the first glottal pulse is measured onthe low-pass filtered residual signal.

The position of the first glottal pulse is coded using 6 bits in thefollowing manner. The precision used to encode the position of the firstglottal pulse depends on the closed-loop pitch value for the firstsubframe T₀. This is possible because this value is known both by theencoder and the decoder, and is not subject to error propagation afterone or several frame losses. When T₀ is less than 64, the position ofthe first glottal pulse relative to the beginning of the frame isencoded directly with a precision of one sample. When 64=T₀<128, theposition of the first glottal pulse relative to the beginning of theframe is encoded with a precision of two samples by using a simpleinteger division, i.e. τ/2. When T₀=128, the position of the firstglottal pulse relative to the beginning of the frame is encoded with aprecision of four samples by further dividing τ by 2. The inverseprocedure is done at the decoder. If T₀<64, the received quantizedposition is used as is. If 64=T₀<128, the received quantized position ismultiplied by 2 and incremented by 1. If T₀=128, the received quantizedposition is multiplied by 4 and incremented by 2 (incrementing by 2results in uniformly distributed quantization error).

According to another embodiment of the invention where the shape of thefirst glottal pulse is encoded, the position of the first glottal pulseis determined by a correlation analysis between the residual signal andthe possible pulse shapes, signs (positive or negative) and positions.The pulse shape can be taken from a codebook of pulse shapes known atboth the encoder and the decoder, this method being known as vectorquantization by those of ordinary skill in the art. The shape, sign andamplitude of the first glottal pulse are then encoded and transmitted tothe decoder.

Periodicity Information

In case there is enough bandwidth, a periodicity information, or voicinginformation, can be computed and transmitted, and used at the decoder toimprove the frame erasure concealment. The voicing information isestimated based on the normalized correlation. It can be encoded quiteprecisely with 4 bits, however, 3 or even 2 bits would suffice ifnecessary. The voicing information is necessary in general only forframes with some periodic components and better voicing resolution isneeded for highly voiced frames. The normalized correlation is given inEquation (2) and it is used as an indicator to the voicing Information.It is quantized in first glottal pulse search and quantization module507. In this illustrative embodiment, a piece-wise linear quantizer hasbeen used to encode the voicing information as follows: $\begin{matrix}\begin{matrix}{{i = {\frac{{r_{x}(2)} - 0.65}{0.03} + 0.5}},} & {{{for}\quad{r_{X}(2)}} < 0.92}\end{matrix} & (18) \\\begin{matrix}{{i = {9 + \frac{{r_{x}(2)} - 0.92}{0.01} + 05}},} & {{{for}\quad{r_{X}(2)}} \geq 0.92}\end{matrix} & (19)\end{matrix}$

Again, the integer part of i is encoded and transmitted. The correlationrx(2) has the same meaning as in Equation (1). In Equation (18) thevoicing is linearly quantized between 0.65 and 0.89 with the step of0.03. In Equation (19) the voicing is linearly quantized between 0.92and 0.98 with the step of 0.01.

If larger quantization range is needed, the following linearquantization can be used: $\begin{matrix}{i = {\frac{{\overset{\_}{r}}_{x} - 0.4}{0.04} + 0.5}} & (20)\end{matrix}$

This equation quantizes the voicing in the range of 0.4 to 1 with thestep of 0.04. The correlation {overscore (r)}_(x) is defined in Equation(2a).

The equations (18) and (19) or the equation (20) are then used in thedecoder to compute r_(x)(2) or {overscore (r)}_(x). Let us call thisquantized normalized correlation r_(q). If the voicing cannot betransmitted, it can be estimated using the voicing factor from Equation(2a) by mapping it in the range from 0 to 1.r _(q)=0.5·(f+1)  (21)

Processing of Erased Frames

The FER concealment techniques in this illustrative embodiment aredemonstrated on ACELP type encoders. They can be however easily appliedto any speech codec where the synthesis signal is generated by filteringan excitation signal through an LP synthesis filter. The concealmentstrategy can be summarized as a convergence of the signal energy and thespectral envelope to the estimated parameters of the background noise.The periodicity of the signal is converging to zero. The speed of theconvergence is dependent on the parameters of the last good receivedframe class and the number of consecutive erased frames and iscontrolled by an attenuation factor α. The factor α is further dependenton the stability of the LP filter for UNVOICED frames. In general, theconvergence is slow if the last good received frame is in a stablesegment and is rapid if the frame is in a transition segment. The valuesof a are summarized in Table 5. TABLE 5 Values of the FER concealmentattenuation factor α Last Good Received Number of successive Frameerased frames α ARTIFICIAL ONSET 0.6 ONSET, VOICED =3 1.0 >3 0.4 VOICEDTRANSITION 0.4 UNVOICED TRANSITION 0.8 UNVOICED =1 0.6 θ + 0.4 >1 0.4

A stability factor θ is computed based on a distance measure between theadjacent LP filters. Here, the factor θ is related to the ISF(Immittance Spectral Frequencies) distance measure and it is bounded by0≦θ≦1, with larger values of θ corresponding to more stable signals.This results in decreasing energy and spectral envelope fluctuationswhen an isolated frame erasure occurs inside a stable unvoiced segment.

The signal class remains unchanged during the processing of erasedframes, i.e. the class remains the same as in the last good receivedframe.

Construction of the Periodic Part of the Excitation

For a concealment of erased frames following a correctly receivedUNVOICED frame, no periodic part of the excitation signal is generated.For a concealment of erased frames following a correctly received frameother than UNVOICED, the periodic part of the excitation signal isconstructed by repeating the last pitch period of the previous frame. Ifit is the case of the 1 st erased frame after a good frame, this pitchpulse is first low-pass filtered. The filter used is a simple 3-taplinear phase FIR filter with filter coefficients equal to 0.18, 0.64 and0.18. If a voicing information is available, the filter can be alsoselected dynamically with a cut-off frequency dependent on the voicing.

The pitch period T_(c) used to select the last pitch pulse and henceused during the concealment is defined so that pitch multiples orsubmultiples can be avoided, or reduced. The following logic is used indetermining the pitch period T_(c).if ((T ₃<1.8 T _(s)) AND (T ₃>0.6 T _(s))) OR (T _(cnt=)30), then T_(c)=T₃, else T_(c)=T_(s).Here, T₃ is the rounded pitch period of the 4^(th) subframe of the lastgood received frame and T_(s) is the rounded pitch period of the 4^(th)subframe of the last good stable voiced frame with coherent pitchestimates. A stable voiced frame is defined here as a VOICED framepreceded by a frame of voiced type (VOICED TRANSITION, VOICED, ONSET).The coherence of pitch is verified in this implementation by examiningwhether the closed-loop pitch estimates are reasonably close, i.e.whether the ratios between the last subframe pitch, the 2nd subframepitch and the last subframe pitch of the previous frame are within theinterval (0.7, 1.4).

This determination of the pitch period T_(c) means that if the pitch atthe end of the last good frame and the pitch of the last stable frameare close to each other, the pitch of the last good frame is used.Otherwise this pitch is considered unreliable and the pitch of the laststable frame is used instead to avoid the impact of wrong pitchestimates at voiced onsets. This logic makes however sense only if thelast stable segment is not too far in the past. Hence a counter T_(cnt)is defined that limits the reach of the influence of the last stablesegment. If T_(cnt) is greater or equal to 30, i.e. if there are atleast 30 frames since the last T_(s) update, the last good frame pitchis used systematically. T_(cnt) is reset to 0 every time a stablesegment is detected and T_(s) is updated. The period T_(c) is thenmaintained constant during the concealment for the whole erased block.

As the last pulse of the excitation of the previous frame is used forthe construction of the periodic part, its gain is approximately correctat the beginning of the concealed frame and can be set to 1. The gain isthen attenuated linearly throughout the frame on a sample by samplebasis to achieve the value of α at the end of the frame.

The values of α correspond to the Table 5 with the exception that theyare modified for erasures following VOICED and ONSET frames to take intoconsideration the energy evolution of voiced segments. This evolutioncan be extrapolated to some extend by using the pitch excitation gainvalues of each subframe of the last good frame. In general, if thesegains are greater than 1, the signal energy is increasing, if they arelower than 1, the energy is decreasing. α is thus multiplied by acorrection factor f_(b) computed as follows:f _(b)={square root}{square root over(0.1b(0)+0.2b(1)+0.3b(2)+0.4b(3))}  (23)where b(0), b(1), b(2) and b(3) are the pitch gains of the foursubframes of the last correctly received frame. The value of f_(b) isclipped between 0.98 and 0.85 before being used to scale the periodicpart of the excitation. In this way, strong energy increases anddecreases are avoided.

For erased frames following a correctly received frame other thanUNVOICED, the excitation buffer is updated with this periodic part ofthe excitation only. This update will be used to construct the pitchcodebook excitation in the next frame.

Construction of the Random Part of the Excitation

The innovation (non-periodic) part of the excitation signal is generatedrandomly. It can be generated as a random noise or by using the CELPinnovation codebook with vector indexes generated randomly. In thepresent illustrative embodiment, a simple random generator withapproximately uniform distribution has been used. Before adjusting theinnovation gain, the randomly generated innovation is scaled to somereference value, fixed here to the unitary energy per sample.

At the beginning of an erased block, the innovation gain gs isinitialized by using the innovation excitation gains of each subframe ofthe last good frame:g _(s)=0.1g(0)+0.2g(1)+0.3g(2)+0.4g(3)  (23a)where g(0), g(1), g(2) and g(3) are the fixed codebook, or innovation,gains of the four (4) subframes of the last correctly received frame.The attenuation strategy of the random part of the excitation issomewhat different from the attenuation of the pitch excitation. Thereason is that the pitch excitation (and thus the excitationperiodicity) is converging to 0 while the random excitation isconverging to the comfort noise generation (CNG) excitation energy. Theinnovation gain attenuation is done as:g _(s) ¹ =α·g _(s) ⁰+(1−α)·g _(n)  (24)where g_(s) ¹ is the innovation gain at the beginning of the next frame,g_(s) ⁰ is the innovative gain at the beginning of the current frame,g_(n) is the gain of the excitation used during the comfort noisegeneration and a is as defined in Table 5. Similarly to the periodicexcitation attenuation, the gain is thus attenuated linearly throughoutthe frame on a sample by sample basis starting with g_(s) ⁰ and going tothe value of g_(s) ¹ that would be achieved at the beginning of the nextframe.

Finally, if the last good (correctly received or non erased) receivedframe is different from UNVOICED, the innovation excitation is filteredthrough a linear phase FIR high-pass filter with coefficients −0.0125,−0.109, 0.7813, −0.109, −0.0125. To decrease the amount of noisycomponents during voiced segments, these filter coefficients aremultiplied by an adaptive factor equal to (0.75-0.25 r_(v)), r_(v) beingthe voicing factor as defined in Equation (1). The random part of theexcitation is then added to the adaptive excitation to form the totalexcitation signal.

If the last good frame is UNVOICED, only the innovation excitation isused and it is further attenuated by a factor of 0.8. In this case, thepast excitation buffer is updated with the innovation excitation as noperiodic part of the excitation is available.

Spectral Envelope Concealment, Synthesis and Updates

To synthesize the decoded speech, the LP filter parameters must beobtained. The spectral envelope is gradually moved to the estimatedenvelope of the ambient noise. Here the ISF representation of LPparameters is used:l ¹(j)=αl ⁰(j)+(1−α)l _(n)(j), j=0, . . . , p−1  (25)In equation (25), l¹(j) is the value of the j^(th) ISF of the currentframe, 106) is the value of the j^(th) ISF of the previous frame,l^(n)(j) is the value of the j^(th) ISF of the estimated comfort noiseenvelope and p is the order of the LP filter.

The synthesized speech is obtained by filtering the excitation signalthrough the LP synthesis filter. The filter coefficients are computedfrom the ISF representation and are interpolated for each subframe (four(4) times per frame) as during normal encoder operation.

As innovation gain quantizer and ISF quantizer both use a prediction,their memory will not be up to date after the normal operation isresumed. To reduce this effect, the quantizers' memories are estimatedand updated at the end of each erased frame.

Recovery of the Normal Operation After Erasure

The problem of the recovery after an erased block of frames is basicallydue to the strong prediction used practically in all modern speechencoders. In particular, the CELP type speech coders achieve their highsignal to noise ratio for voiced speech due to the fact that they areusing the past excitation signal to encode the present frame excitation(long-term or pitch prediction). Also, most of the quantizers (LPquantizers, gain quantizers) make use of a prediction.

Artificial Onset Construction

The most complicated situation related to the use of the long-termprediction in CELP encoders is when a voiced onset is lost. The lostonset means that the voiced speech onset happened somewhere during theerased block. In this case, the last good received frame was unvoicedand thus no periodic excitation is found in the excitation buffer. Thefirst good frame after the erased block is however voiced, theexcitation buffer at the encoder is highly periodic and the adaptiveexcitation has been encoded using this periodic past excitation. As thisperiodic part of the excitation is completely missing at the decoder, itcan take up to several frames to recover from this loss.

If an ONSET frame is lost (i.e. a VOICED good frame arrives after anerasure, but the last good frame before the erasure was UNVOICED asshown in FIG. 6), a special technique is used to artificiallyreconstruct the lost onset and to trigger the voiced synthesis. At thebeginning of the 1st good frame after a lost onset, the periodic part ofthe excitation is constructed artificially as a low-pass filteredperiodic train of pulses separated by a pitch period. In the presentillustrative embodiment, the low-pass filter is a simple linear phaseFIR filter with the impulse response h_(low)={−0.0125, 0.109, 0.7813,0.109, −0.0125}. However, the filter could be also selected dynamicallywith a cut-off frequency corresponding to the voicing information ifthis information is available. The innovative part of the excitation isconstructed using normal CELP decoding. The entries of the innovationcodebook could be also chosen randomly (or the innovation itself couldbe generated randomly), as the synchrony with the original signal hasbeen lost anyway.

In practice, the length of the artificial onset is limited so that atleast one entire pitch period is constructed by this method and themethod is continued to the end of the current subframe. After that, aregular ACELP processing is resumed. The pitch period considered is therounded average of the decoded pitch periods of all subframes where theartificial onset reconstruction is used. The low-pass filtered impulsetrain is realized by placing the impulse responses of the low-passfilter in the adaptive excitation buffer (previously initialized tozero). The first impulse response will be centered at the quantizedposition rq (transmitted within the bitstream) with respect to the framebeginning and the remaining impulses will be placed with the distance ofthe averaged pitch up to the end of the last subframe affected by theartificial onset construction. If the available bandwidth is notsufficient to transmit the first glottal pulse position, the firstimpulse response can be placed arbitrarily around the half of the pitchperiod after the current frame beginning.

As an example, for the subframe length of 64 samples, let us considerthat the pitch periods in the first and the second subframe bep(0)=70.75 and p(1)=71. Since this is larger than the subrame size of64, then the artificial onset will be constructed during the first twosubframes and the pitch period will be equal to the pitch average of thetwo subframes rounded to the nearest integer, i.e. 71. The last twosubframes will be processed by normal CELP decoder.

The energy of the periodic part of the artificial onset excitation isthen scaled by the gain corresponding to the quantized and transmittedenergy for FER concealment (As defined in Equations 16 and 17) anddivided by the gain of the LP synthesis filter. The LP synthesis filtergain is computed as: $\begin{matrix}{g_{LP} = \sqrt{\sum\limits_{i = 0}^{63}{h^{2}(i)}}} & (31)\end{matrix}$where h(i) is the LP synthesis filter impulse response Finally, theartificial onset gain is reduced by multiplying the periodic part with0.96. Alternatively, this value could correspond to the voicing if therewere a bandwidth available to transmit also the voicing information.Alternatively without diverting from the essence of this invention, theartificial onset can be also constructed in the past excitation bufferbefore entering the decoder subframe loop. This would have the advantageof avoiding the special processing to construct the periodic part of theartificial onset and the regular CELP decoding could be used instead.

The LP filter for the output speech synthesis is not interpolated in thecase of an artificial onset construction. Instead, the received LPparameters are used for the synthesis of the whole frame.

Energy Control

The most important task at the recovery after an erased block of framesis to properly control the energy of the synthesized speech signal. Thesynthesis energy control is needed because of the strong predictionusually used in modem speech coders. The energy control is mostimportant when a block of erased frames happens during a voiced segment.When a frame erasure arrives after a voiced frame, the excitation of thelast good frame is typically used during the concealment with someattenuation strategy. When a new LP filter arrives with the first goodframe after the erasure, there can be a mismatch between the excitationenergy and the gain of the new LP synthesis filter. The new synthesisfilter can produce a synthesis signal with an energy highly differentfrom the energy of the last synthesized erased frame and also from theoriginal signal energy.

The energy control during the first good frame after an erased frame canbe summarized as follows. The synthesized signal is scaled so that itsenergy is similar to the energy of the synthesized speech signal at theend of the last erased frame at the beginning of the first good frameand is converging to the transmitted energy towards the end of the framewith preventing a too important energy increase.

The energy control is done in the synthesized speech signal domain. Evenif the energy is controlled in the speech domain, the excitation signalmust be scaled as it serves as long term prediction memory for thefollowing frames. The synthesis is then redone to smooth thetransitions. Let g₀ denote the gain used to scale the 1st sample in thecurrent frame and g₁ the gain used at the end of the frame. Theexcitation signal is then scaled as follows:u _(s)(i)=g _(AGC)(i)·u(i), i=0, . . . , L−1  (32)where u_(s)(i) is the scaled excitation, u(i) is the excitation beforethe scaling, L is the frame length and g_(AGC)(i) is the gain startingfrom g₀ and converging exponentially to g₁:g _(AGC)(i)=f _(AGC) g _(AGC)(i−1)+(1− _(AGC))g ₁ i=0, . . . , L−1with the initialization of g_(AGC)(−1)=g₀, where f_(AGC) is theattenuation factor set in this implementation to the value of 0.98. Thisvalue has been found experimentally as a compromise of having a smoothtransition from the previous (erased) frame on one side, and scaling thelast pitch period of the current frame as much as possible to thecorrect (transmitted) value on-the-other-side. This is important becausethe transmitted energy value is estimated pitch synchronously at the endof the frame. The gains g0 and g1 are defined as:g ₀={square root}{square root over (E ⁻¹ /E ₀)}  (33a)g ₁={square root}{square root over (E _(q) /E ₁)}  (33b)where E⁻¹ is the energy computed at the end of the previous (erased)frame, E₀ is the energy at the beginning of the current (recovered)frame, E₁ is the energy at the end of the current frame and E_(q) is thequantized transmitted energy information at the end of the currentframe, computed at the encoder from Equations (16, 17). E⁻¹ and E₁ arecomputed similarly with the exception that they are computed on thesynthesized speech signal s′. E⁻¹ is computed pitch synchronously usingthe concealment pitch period T_(c) and E₁ uses the last subframe roundedpitch T₃. E₀ is computed similarly using the rounded pitch value To ofthe first subframe, the equations (16, 17) being modified to:$E = {\max\limits_{i = 0}^{t_{E}}\left( {s^{\prime\quad 2}(i)} \right)}$for VOICED and ONSET frames. t_(E) equals to the rounded pitch lag ortwice that length if the pitch is shorter than 64 samples. For otherframes,$E = {\frac{1}{t_{0}}{\sum\limits_{i = 0}^{t_{E}}{s^{\prime\quad 2}(i)}}}$with t_(E) equal to the half of the frame length. The gains g₀ and g₁are further limited to a maximum allowed value, to prevent strongenergy. This value has been set to 1.2 in the present illustrativeimplementation.

Conducting frame erasure concealment and decoder recovery comprises,when a gain of a LP filter of a first non erased frame receivedfollowing frame erasure is higher than a gain of a LP filter of a lastframe erased during said frame erasure, adjusting the energy of an LPfilter excitation signal produced in the decoder during the receivedfirst non erased frame to a gain of the LP filter of said received firstnon erased frame using the following relation:

If E_(q) cannot be transmitted, E_(q) is set to E₁. If however theerasure happens during a voiced speech segment (i.e. the last good framebefore the erasure and the first good frame after the erasure areclassified as VOICED TRANSITION, VOICED or ONSET), further precautionsmust be taken because of the possible mismatch between the excitationsignal energy and the LP filter gain, mentioned previously. Aparticularly dangerous situation arises when the gain of the LP filterof a first non erased frame received following frame erasure is higherthan the gain of the LP filter of a last frame erased during that frameerasure. In that particular case, the energy of the LP filter excitationsignal produced in the decoder during the received first non erasedframe is adjusted to a gain of the LP filter of the received first nonerased frame using the following relation:$E_{q} = {E_{1}\frac{E_{LP0}}{E_{LP1}}}$where E_(LPO) is the energy of the LP filter impulse response of thelast good frame before the erasure and E_(LP1) is the energy of the LPfilter of the first good frame after the erasure. In thisimplementation, the LP filters of the last subframes in a frame areused. Finally, the value of E_(q) is limited to the value of E⁻¹ in thiscase (voiced segment erasure without E_(q) information beingtransmitted).

The following exceptions, all related to transitions in speech signal,further overwrite the computation of g₀. If artificial onset is used inthe current frame, g₀ is set to 0.5 g₁, to make the onset energyincrease gradually.

In the case of a first good frame after an erasure classified as ONSET,the gain g₀ is prevented to be higher that g₁. This precaution is takento prevent a positive gain adjustment at the beginning of the frame(which is probably still at least partially unvoiced) from amplifyingthe voiced onset (at the end of the frame).

Finally, during a transition from voiced to, unvoiced (i.e. that lastgood frame being classified as VOICED TRANSITION, VOICED or ONSET andthe current frame being classified UNVOICED) or during a transition froma non-active speech period to active speech period (last good receivedframe being encoded as comfort noise and current frame being encoded asactive speech), the g₀ is set to g₁.

In case of a voiced segment erasure, the wrong energy problem canmanifest itself also in frames following the first good frame after theerasure. This can happen even if the first good frame's energy has beenadjusted as described above. To attenuate this problem, the energycontrol can be continued up to the end of the voiced segment.

Although the present invention has been described in the foregoingdescription in relation to an illustrative embodiment thereof, thisillustrative embodiment can be modified as will, within the scope of theappended claims without departing from the scope and spirit of thesubject invention.

1. A method of concealing frame erasure caused by frames of an encodedsound signal erased during transmission from an encoder to a decoder,comprising: determining, in the encoder, concealment/recoveryparameters; transmitting to the decoder concealment/recovery parametersdetermined in the encoder; and in the decoder, conducting frame erasureconcealment and decoder recovery in response to the receivedconcealment/recovery parameters.
 2. A method as defined in claim 1,further comprising quantizing, in the encoder, the concealment/recoveryparameters prior to transmitting said concealment/recovery parameters tothe decoder.
 3. A method as defined in claim 1, wherein theconcealment/recovery parameters are selected from the group consistingof: a signal classification parameter, an energy information parameterand a phase information parameter.
 4. A method as defined in claim 3,wherein determination of the phase information parameter comprisesdetermining a position of a first glottal pulse in a frame of theencoded sound signal.
 5. A method as defined in claim 1, whereinconducting frame erasure concealment and decoder recovery comprisesconducting decoder recovery in response to a determined position of afirst glottal pulse after at least one lost voice onset.
 6. A method asdefined in claim 1, wherein conducting frame erasure concealment anddecoder recovery comprises, when at least one onset frame is lost,constructing a periodic excitation part artificially as a low-passfiltered periodic train of pulses separated by a pitch period.
 7. Amethod as defined in claim 6, wherein: the method comprises quantizingthe position of the first glottal pulse prior to transmission of saidposition of the first glottal pulse to the decoder; and constructing aperiodic excitation part comprises realizing the low-pass filteredperiodic train of pulses by: centering a first impulse response of alow-pass filter on the quantized position of the first glottal pulsewith respect to the beginning of a frame; and placing remaining impulseresponses of the low-pass filter each with a distance corresponding toan average pitch value from the preceding impulse response up to the endof a last subframe affected by the artificial construction.
 8. A methodas defined in claim 4, wherein determination of the phase informationparameter further comprises encoding, in the encoder, the shape, signand amplitude of the first glottal pulse and transmitting the encodedshape, sign and amplitude from the encoder to the decoder.
 9. A methodas defined in claim 4, wherein determining the position of the firstglottal pulse comprises: measuring the first glottal pulse as a sampleof maximum amplitude within a pitch period; and quantizing the positionof the sample of maximum amplitude within the pitch period.
 10. A methodas defined in claim 1, wherein: the sound signal is a speech signal; anddetermination, in the encoder, of concealment/recovery parameterscomprises classifying successive frames of the encoded sound signal asunvoiced, unvoiced transition, voiced transition, voiced, or onset. 11.A method as defined in claim 10, wherein classifying the successiveframes comprises classifying as unvoiced every frame which is anunvoiced frame, every frame without active speech, and every voicedoffset frame having an end tending to be unvoiced.
 12. A method asdefined in claim 10, wherein classifying the successive frames comprisesclassifying as unvoiced transition every unvoiced frame having an endwith a possible voiced onset which is too short or not built well enoughto be processed as a voiced frame.
 13. A method as defined in claim 10,wherein classifying the successive frames comprises classifying asvoiced transition every voiced frame with relatively weak voicedcharacteristics, including voiced frames with rapidly changingcharacteristics and voiced offsets lasting the whole frame, wherein aframe classified as voiced transition follows only frames classified asvoiced transition, voiced or onset.
 14. A method as defined in claim 10,wherein classifying the successive frames comprises classifying asvoiced every voiced frames with stable characteristics, wherein a frameclassified as voiced follows only frames classified as voicedtransition, voiced or onset.
 15. A method as defined in claim 10,wherein classifying the successive frames comprises classifying as onsetevery voiced frame with stable characteristics following a frameclassified as unvoiced or unvoiced transition.
 16. A method as definedin claim 10, comprising determining the classification of the successiveframes of the encoded sound signal on the basis of at least a part ofthe following parameters: a normalized correlation parameter, a spectraltilt parameter, a signal-to-noise ratio parameter, a pitch stabilityparameter, a relative frame energy parameter, and a zero crossingparameter.
 17. A method as defined in claim 16, wherein determining theclassification of the successive frames comprises: computing a figure ofmerit on the basis of the normalized correlation parameter, spectraltilt parameter, signal-to-noise ratio parameter, pitch stabilityparameter, relative frame energy parameter, and zero crossing parameter;and comparing the figure of merit to thresholds to determine theclassification.
 18. A method as defined in claim 16, comprisingcalculating the normalized correlation parameter on the basis of acurrent weighted version of the speech signal and a past weightedversion of said speech signal.
 19. A method as defined in claim 16,comprising estimating the spectral tilt parameter as a ratio between anenergy concentrated in low frequencies and an energy concentrated inhigh frequencies.
 20. A method as defined in claim 16, comprisingestimating the signal-to-noise ratio parameter as a ratio between anenergy of a weighted version of the speech signal of a current frame andan energy of an error between said weighted version of the speech signalof the current frame and a weighted version of a synthesized speechsignal of said current frame.
 21. A method as defined in claim 16,comprising computing the pitch stability parameter in response toopen-loop pitch estimates for a first half of a current frame, a secondhalf of the current frame and a look-ahead.
 22. A method as defined inclaim 16, comprising computing the relative frame energy parameter as adifference between an energy of a current frame and a long-term averageof an energy of active speech frames.
 23. A method as defined in claim16, comprising determining the zero-crossing parameter as a number oftimes a sign of the speech signal changes from a first polarity to asecond polarity.
 24. A method as defined in claim 16, comprisingcomputing at least one of the normalized correlation parameter, spectraltilt parameter, signal-to-noise ratio parameter, pitch stabilityparameter, relative frame energy parameter, and zero crossing parameterusing an available look-ahead to take into consideration the behavior ofthe speech signal in the following frame.
 25. A method as defined inclaim 16, further comprising determining the classification of thesuccessive frames of the encoded sound signal also on the basis of avoice activity detection flag.
 26. A method as defined in claim 3wherein: the sound signal is a speech signal; determination, in theencoder, of concealment/recovery parameters comprises classifyingsuccessive frames of the encoded sound signal as unvoiced, unvoicedtransition, voiced transition, voiced, or onset; and determiningconcealment/recovery parameters comprises calculating the energyinformation parameter in relation to a maximum of a signal energy forframes classified as voiced or onset, and calculating the energyinformation parameter in relation to an average energy per sample forother frames.
 27. A method as defined in claim 1, wherein determining,in the encoder, concealment/recovery parameters comprises computing avoicing information parameter.
 28. A method as defined in claim 27,wherein: the sound signal is a speech signal; determination, in theencoder, of concealment/recovery parameters comprises classifyingsuccessive frames of the encoded sound signal; said method comprisesdetermining the classification of the successive frames of the encodedsound signal on the basis of a normalized correlation parameter; andcomputing the voicing information parameter comprises estimating saidvoicing information parameter on the basis of the normalizedcorrelation.
 29. A method as defined in claim 1, wherein conductingframe erasure concealment and decoder recovery comprises: followingreceiving a non erased unvoiced frame after frame erasure, generating noperiodic part of a LP filter excitation signal; following receiving,after frame erasure, of a non erased frame other than unvoiced,constructing a periodic part of the LP filter excitation signal byrepeating a last pitch period of a previous frame.
 30. A method asdefined in claim 29, wherein constructing the periodic part of the LPfilter excitation signal comprises filtering the repeated last pitchperiod of the previous frame through a low-pass filter.
 31. A method asdefined in claim 30, wherein: determining concealment/recoveryparameters comprises computing a voicing information parameter; thelow-pass filter has a cut-off frequency; and constructing the periodicpart of the excitation signal comprises dynamically adjusting thecut-off frequency in relation to the voicing information parameter. 32.A method as defined in claim 1, wherein conducting frame erasureconcealment and decoder recovery comprises randomly generating anon-periodic, innovation part of a LP filter excitation signal.
 33. Amethod as defined in claim 32, wherein randomly generating thenon-periodic, innovation part of the LP filter excitation signalcomprises generating a random noise.
 34. A method as defined in claim32, wherein randomly generating the non-periodic, innovation part of theLP filter excitation signal comprises randomly generating vector indexesof an innovation codebook.
 35. A method as defined in claim 32, wherein:the sound signal is a speech signal; determination ofconcealment/recovery parameters comprises classifying successive framesof the encoded sound signal as unvoiced, unvoiced transition, voicedtransition, voiced, or onset; and randomly generating the non-periodic,innovation part of the LP filter excitation signal further comprises: ifthe last correctly received frame is different from unvoiced, filteringthe innovation part of the excitation signal through a high pass filter;and if the last correctly received frame is unvoiced, using only theinnovation part of the excitation signal.
 36. A method as defined inclaim 1, wherein: the sound signal is a speech signal; determination, inthe encoder, of concealment/recovery parameters comprises classifyingsuccessive frames of the encoded sound signal as unvoiced, unvoicedtransition, voiced transition, voiced, or onset; conducting frameerasure concealment and decoder recovery comprises, when an onset frameis lost which is indicated by the presence of a voiced frame followingframe erasure and an unvoiced frame before frame erasure, artificiallyreconstructing the lost onset by constructing a periodic part of anexcitation signal as a low-pass filtered periodic train of pulsesseparated by a pitch period.
 37. A method as defined in claim 36,wherein conducting frame erasure concealment and decoder recoveryfurther comprises constructing an innovation part of the excitationsignal by means of normal decoding.
 38. A method as defined in claim 37,wherein constructing an innovation part of the excitation signalcomprises randomly choosing entries of an innovation codebook.
 39. Amethod as defined in claim 36, wherein artificially reconstructing thelost onset frame comprises limiting a length of the artificiallyreconstructed onset so that at least one entire pitch period isconstructed by the onset artificial reconstruction, said reconstructionbeing continued until the end of a current subframe.
 40. A method asdefined in claim 39, wherein conducting frame erasure concealment anddecoder recovery further comprises, after artificial reconstruction ofthe lost onset, resuming a regular CELP processing wherein the pitchperiod is a rounded average of decoded pitch periods of all subframeswhere the artificial onset reconstruction is used.
 41. A method asdefined in claim 3, wherein conducting frame erasure concealment anddecoder recovery comprises: controlling an energy of a synthesized soundsignal produced by the decoder, controlling energy of the synthesizedsound signal comprising scaling the synthesized sound signal to renderan energy of said synthesized sound signal at the beginning of a firstnon erased frame received following frame erasure similar to an energyof said synthesized signal at the end of a last frame erased during saidframe erasure; and converging the energy of the synthesized sound signalin the received first non erased frame to an energy corresponding to thereceived energy information parameter toward the end of said receivedfirst non erased frame while limiting an increase in energy.
 42. Amethod as defined in claim 3, wherein: the energy information parameteris not transmitted from the encoder to the decoder; and conducting frameerasure concealment and decoder recovery comprises, when a gain of a LPfilter of a first non erased frame received following frame erasure ishigher than a gain of a LP filter of a last frame erased during saidframe erasure, adjusting the energy of an LP filter excitation signalproduced in the decoder during the received first non erased frame to again of the LP filter of said received first non erased frame.
 43. Amethod as defined in claim 42 wherein: adjusting the energy of an LPfilter excitation signal produced in the decoder during the receivedfirst non erased frame to a gain of the LP filter of said received firstnon erased frame comprises using the following relation:$E_{q} = {E_{1}\frac{E_{LP0}}{E_{LP1}}}$ where E1 is the energy at theend of the current frame, E_(LP0) is the energy of an impulse responseof the LP filter to the last non erased frame received before the frameerasure, and E_(LP1) is the energy of the impulse response of the LPfilter to the received first non erased frame following frame erasure.44. A method as defined in claim 41, wherein: the sound signal is aspeech signal; determination, in the encoder, of concealment/recoveryparameters comprises classifying successive frames of the encoded soundsignal as unvoiced, unvoiced transition, voiced transition, voiced, oronset; and when the first non erased frame received after a frameerasure is classified as ONSET, conducting frame erasure concealment anddecoder recovery comprises limiting to a given value a gain used forscaling the synthesized sound signal.
 45. A method as defined in claim41, wherein: the sound signal is a speech signal; determination, in theencoder, of concealment/recovery parameters comprises classifyingsuccessive frames of the encoded sound signal as unvoiced, unvoicedtransition, voiced transition, voiced, or onset; and said methodcomprising making a gain used for scaling the synthesized sound signalat the beginning of the first non erased frame received after frameerasure equal to a gain used at the end of said received first nonerased frame: during a transition from a voiced frame to an unvoicedframe, in the case of a last non erased frame received before frameerasure classified as voiced transition, voice or onset and a first nonerased frame received after frame erasure classified as unvoiced; andduring a transition from a non-active speech period to an active speechperiod, when the last non erased frame received before frame erasure isencoded as comfort noise and the first non erased frame received afterframe erasure is encoded as active speech.
 46. A method of concealingframe erasure caused by frames of an encoded sound signal erased duringtransmission from an encoder to a decoder, comprising: determining, inthe encoder, concealment/recovery parameters; and transmitting to thedecoder concealment/recovery parameters determined in the encoder.
 47. Amethod as defined in claim 46, further comprising quantizing, in theencoder, the concealment/recovery parameters prior to transmitting saidconcealment/recovery parameters to the decoder.
 48. A method as definedin claim 46, wherein the concealment/recovery parameters are selectedfrom the group consisting of: a signal classification parameter, anenergy information parameter and a phase information parameter.
 49. Amethod as defined in claim 48, wherein determination of the phaseinformation parameter comprises determining a position of a firstglottal pulse in a frame of the encoded sound signal.
 50. A method asdefined in claim 49, wherein determination of the phase informationparameter further comprises encoding, in the encoder, the shape, signand amplitude of the first glottal pulse and transmitting the encodedshape, sign and amplitude from the encoder to the decoder.
 51. A methodas defined in claim 49, wherein determining the position of the firstglottal pulse comprises: measuring the first glottal pulse as a sampleof maximum amplitude within a pitch period; and quantizing the positionof the sample of maximum amplitude within the pitch period.
 52. A methodas defined in claim 46, wherein: the sound signal is a speech signal;and determination, in the encoder, of concealment/recovery parameterscomprises classifying successive frames of the encoded sound signal asunvoiced, unvoiced transition, voiced transition, voiced, or onset. 53.A method as defined in claim 52, wherein classifying the successiveframes comprises classifying as unvoiced every frame which is anunvoiced frame, every frame without active speech, and every voicedoffset frame having an end tending to be unvoiced.
 54. A method asdefined in claim 52, wherein classifying the successive frames comprisesclassifying as unvoiced transition every unvoiced frame having an endwith a possible voiced onset which is too short or not built well enoughto be processed as a voiced frame.
 55. A method as defined in claim 52,wherein classifying the successive frames comprises classifying asvoiced transition every voiced frame with relatively weak voicedcharacteristics, including voiced frames with rapidly changingcharacteristics and voiced offsets lasting the whole frame, wherein aframe classified as voiced transition follows only frames classified asvoiced transition, voiced or onset.
 56. A method as defined in claim 52,wherein classifying the successive frames comprises classifying asvoiced every voiced frames with stable characteristics, wherein a frameclassified as voiced follows only frames classified as voicedtransition, voiced or onset.
 57. A method as defined in claim 52,wherein classifying the successive frames comprises classifying as onsetevery voiced frame with stable characteristics following a frameclassified as unvoiced or unvoiced transition.
 58. A method as definedin claim 52, comprising determining the classification of the successiveframes of the encoded sound signal on the basis of at least a part ofthe following parameters: a normalized correlation parameter, a spectraltilt parameter, a signal-to-noise ratio parameter, a pitch stabilityparameter, a relative frame energy parameter, and a zero crossingparameter.
 59. A method as defined in claim 58, wherein determining theclassification of the successive frames comprises: computing a figure ofmerit on the basis of the normalized correlation parameter, spectraltilt parameter, signal-to-noise ratio parameter, pitch stabilityparameter, relative frame energy parameter, and zero crossing parameter;and comparing the figure of merit to thresholds to determine theclassification.
 60. A method as defined in claim 58, comprisingcalculating the normalized correlation parameter on the basis of acurrent weighted version of the speech signal and a past weightedversion of said speech signal.
 61. A method as defined in claim 58,comprising estimating the spectral tilt parameter as a ratio between anenergy concentrated in low frequencies and an energy concentrated inhigh frequencies.
 62. A method as defined in claim 58, comprisingestimating the signal-to-noise ratio parameter as a ratio between anenergy of a weighted version of the speech signal of a current frame andan energy of an error between said weighted version of the speech signalof the current frame and a weighted version of a synthesized speechsignal of said current frame.
 63. A method as defined in claim 58,comprising computing the pitch stability parameter in response toopen-loop pitch estimates for a first half of a current frame, a secondhalf of the current frame and a look-ahead.
 64. A method as defined inclaim 58, comprising computing the relative frame energy parameter as adifference between an energy of a current frame and a long-term averageof an energy of active speech frames.
 65. A method as defined in claim58, comprising determining the zero-crossing parameter as a number oftimes a sign of the speech signal changes from a first polarity to asecond polarity.
 66. A method as defined in claim 58, comprisingcomputing at least one of the normalized correlation parameter, spectraltilt parameter, signal-to-noise ratio parameter, pitch stabilityparameter, relative frame energy parameter, and zero crossing parameterusing an available look-ahead to take into consideration the behavior ofthe speech signal in the following frame.
 67. A method as defined inclaim 58, further comprising determining the classification of thesuccessive frames of the encoded sound signal also on the basis of avoice activity detection flag.
 68. A method as defined in claim 48wherein: the sound signal is a speech signal; determination, in theencoder, of concealment/recovery parameters comprises classifyingsuccessive frames of the encoded sound signal as unvoiced, unvoicedtransition, voiced transition, voiced, or onset; and determiningconcealment/recovery parameters comprises calculating the energyinformation parameter in relation to a maximum of a signal energy forframes classified as voiced or onset, and calculating the energyinformation parameter in relation to an average energy per sample forother frames.
 69. A method as defined in claim 46, wherein determining,in the encoder, concealment/recovery parameters comprises computing avoicing information parameter.
 70. A method as defined in claim 68,wherein: the sound signal is a speech signal; determination, in theencoder, of concealment/recovery parameters comprises classifyingsuccessive frames of the encoded sound signal; said method comprisesdetermining the classification of the successive frames of the encodedsound signal on the basis of a normalized correlation parameter; andcomputing the voicing information parameter comprises estimating saidvoicing information parameter on the basis of the normalizedcorrelation.
 71. A method for the concealment of frame erasure caused byframes erased during transmission of a sound signal encoded under theform of signal-encoding parameters from an encoder to a decoder,comprising: determining, in the decoder, concealment/recovery parametersfrom the signal-encoding parameters; in the decoder, conducting erasedframe concealment and decoder recovery in response toconcealment/recovery parameters determined in the decoder.
 72. A methodas defined in claim 71, wherein the concealment/recovery parameters areselected from the group consisting of: a signal classificationparameter, an energy information parameter and a phase informationparameter.
 73. A method as defined in claim 71, wherein: the soundsignal is a speech signal; and determination, in the decoder, ofconcealment/recovery parameters comprises classifying successive framesof the encoded sound signal as unvoiced, unvoiced transition, voicedtransition, voiced, or onset.
 74. A method as defined in claim 71,wherein determining, in the decoder, concealment/recovery parameterscomprises computing a voicing information parameter.
 75. A method asdefined in claim 71, wherein conducting frame erasure concealment anddecoder recovery comprises: following receiving a non erased unvoicedframe after frame erasure, generating no periodic part of a LP filterexcitation signal; following receiving, after frame erasure, of a nonerased frame other than unvoiced, constructing a periodic part of the LPfilter excitation signal by repeating a last pitch period of a previousframe.
 76. A method as defined in claim 75, wherein constructing theperiodic part of the excitation signal comprises filtering the repeatedlast pitch period of the previous frame through a low-pass filter.
 77. Amethod as defined in claim 76, wherein: determining, in the decoder,concealment/recovery parameters comprises computing a voicinginformation parameter; the low-pass filter has a cut-off frequency; andconstructing the periodic part of the LP filter excitation signalcomprises dynamically adjusting the cut-off frequency in relation to thevoicing information parameter.
 78. A method as defined in claim 71,wherein conducting frame erasure concealment and decoder recoverycomprises randomly generating a non-periodic, innovation part of a LPfilter excitation signal.
 79. A method as defined in claim 78, whereinrandomly generating the non-periodic, innovation part of the LP filterexcitation signal comprises generating a random noise.
 80. A method asdefined in claim 78, wherein randomly generating the non-periodic,innovation part of the LP filter excitation signal comprises randomlygenerating vector indexes of an innovation codebook.
 81. A method asdefined in claim 78, wherein: the sound signal is a speech signal;determination, in the decoder, of concealment/recovery parameterscomprises classifying successive frames of the encoded sound signal asunvoiced, unvoiced transition, voiced transition, voiced, or onset; andrandomly generating the non-periodic, innovation part of the LP filterexcitation signal further comprises: if the last received non erasedframe is different from unvoiced, filtering the innovation part of theLP filter excitation signal through a high pass filter; and if the lastreceived non erased frame is unvoiced, using only the innovation part ofthe LP filter excitation signal.
 82. A method as defined in claim 78,wherein: the sound signal is a speech signal; determination, in thedecoder, of concealment/recovery parameters comprises classifyingsuccessive frames of the encoded sound signal as unvoiced, unvoicedtransition, voiced transition, voiced, or onset; conducting frameerasure concealment and decoder recovery comprises, when an onset frameis lost which is indicated by the presence of a voiced frame followingframe erasure and an unvoiced frame before frame erasure, artificiallyreconstructing the lost onset by constructing a periodic part of anexcitation signal as a low-pass filtered periodic train of pulsesseparated by a pitch period.
 83. A method as defined in claim 82,wherein conducting frame erasure concealment and decoder recoveryfurther comprises constructing an innovation part of the LP filterexcitation signal by means of normal decoding.
 84. A method as definedin claim 83, wherein constructing an innovation part of the LP filterexcitation signal comprises randomly choosing entries of an innovationcodebook.
 85. A method as defined in claim 82, wherein artificiallyreconstructing the lost onset comprises limiting a length of theartificially reconstructed onset so that at least one entire pitchperiod is constructed by the onset artificial reconstruction, saidreconstruction being continued until the end of a current subframe. 86.A method as defined in claim 85, wherein conducting frame erasureconcealment and decoder recovery further comprises, after artificialreconstruction of the lost onset, resuming a regular CELP processingwherein the pitch period is a rounded average of decoded pitch periodsof all subframes where the artificial onset reconstruction is used. 87.A method as defined in claim 72, wherein: the energy informationparameter is not transmitted from the encoder to the decoder; andconducting frame erasure concealment and decoder recovery comprises,when a gain of a LP filter of a first non erased frame receivedfollowing frame erasure is higher than a gain of a LP filter of a lastframe erased during said frame erasure, adjusting the energy of an LPfilter excitation signal produced in the decoder during the receivedfirst non erased frame to a gain of the LP filter of said received firstnon erased frame using the following relation:$E_{q} = {E_{1}\frac{E_{LP0}}{E_{LP1}}}$ where E1 is the energy at theend of the current frame, E_(LP0) is the energy of an impulse responseof the LP filter to the last non erased frame received before the frameerasure, and E_(LP1) is the energy of the impulse response of the LPfilter to the received first non erased frame following frame erasure.88. A device for conducting concealment of frame erasure caused byframes of an encoded sound signal erased during transmission from anencoder to a decoder, comprising: means for determining, in the encoder,concealment/recovery parameters; means for transmitting to the decoderconcealment/recovery parameters determined in the encoder; and in thedecoder, means for conducting frame erasure concealment and decoderrecovery in response to received concealment/recovery parametersdetermined by the determining means.
 89. A device as defined in claim88, further comprising means for quantizing, in the encoder, theconcealment/recovery parameters prior to transmitting saidconcealment/recovery parameters to the decoder.
 90. A device as definedin claim 88, wherein the concealment/recovery parameters are selectedfrom the group consisting of: a signal classification parameter, anenergy information parameter and a phase information parameter.
 91. Adevice as defined in claim 90, wherein the means for determining thephase information parameter comprises means for determining the positionof a first glottal pulse in a frame of the encoded sound signal.
 92. Adevice as defined in claim 88, wherein the means for conducting frameerasure concealment and decoder recovery comprises means for conductingdecoder recovery in response to a determined position of a first glottalpulse after at least one lost voice onset.
 93. A device as defined inclaim 88, wherein the means for conducting frame erasure concealment anddecoder recovery comprises means for constructing, when at least oneonset frame is lost, a periodic excitation part artificially as alow-pass filtered periodic train of pulses separated by a pitch period.94. A device as defined in claim 93, wherein: the device comprises meansfor quantizing the position of the first glottal pulse prior totransmission of said position of the first glottal pulse to the decoder;and the means for constructing a periodic excitation part comprisesmeans for realizing the low-pass filtered periodic train of pulses by:centering a first impulse response of a low-pass filter on the quantizedposition of the first glottal pulse with respect to the beginning of aframe; and placing remaining impulse responses of the low-pass filtereach with a distance corresponding to an average pitch value from thepreceding impulse response up to the end of a last subframe affected bythe artificial construction.
 95. A device as defined in claim 91,wherein the means for determining the phase information parameterfurther comprises means for encoding, in the encoder, the shape, signand amplitude of the first glottal pulse and means for transmitting theencoded shape, sign and amplitude from the encoder to the decoder.
 96. Adevice as defined in claim 91, wherein the means for determining theposition of the first glottal pulse comprises: means for measuring thefirst glottal pulse as a sample of maximum amplitude within a pitchperiod; and means for quantizing the position of the sample of maximumamplitude within the pitch period.
 97. A device as defined in claim 88,wherein: the sound signal is a speech signal; and the means fordetermining, in the encoder, concealment/recovery parameters comprisesmeans for classifying successive frames of the encoded sound signal asunvoiced, unvoiced transition, voiced transition, voiced, or onset. 98.A device as defined in claim 97, wherein the means for classifying thesuccessive frames comprises means for classifying as unvoiced everyframe which is an unvoiced frame, every frame without active speech, andevery voiced offset frame having an end tending to be unvoiced.
 99. Adevice as defined in claim 97, wherein the means for classifying thesuccessive frames comprises means for classifying as unvoiced transitionevery unvoiced frame having an end with a possible voiced onset which istoo short or not built well enough to be processed as a voiced frame.100. A device as defined in claim 97, wherein the means for classifyingthe successive frames comprises means for classifying as voicedtransition every voiced frame with relatively weak voicedcharacteristics, including voiced frames with rapidly changingcharacteristics and voiced offsets lasting the whole frame, wherein aframe classified as voiced transition follows only frames classified asvoiced transition, voiced or onset.
 101. A device as defined in claim97, wherein the means for classifying the successive frames comprisesmeans for classifying as voiced every voiced frames with stablecharacteristics, wherein a frame classified as voiced follows onlyframes classified as voiced transition, voiced or onset.
 102. A deviceas defined in claim 97, wherein the means for classifying the successiveframes comprises means for classifying as onset every voiced frame withstable characteristics following a frame classified as unvoiced orunvoiced transition.
 103. A device as defined in claim 97, comprisingmeans for determining the classification of the successive frames of theencoded sound signal on the basis of at least a part of the followingparameters: a normalized correlation parameter, a spectral tiltparameter, a signal-to-noise ratio parameter, a pitch stabilityparameter, a relative frame energy parameter, and a zero crossingparameter.
 104. A device as defined in claim 103, wherein the means fordetermining the classification of the successive frames comprises: meansfor computing a figure of merit on the basis of the normalizedcorrelation parameter, spectral tilt parameter, signal-to-noise ratioparameter, pitch stability parameter, relative frame energy parameter,and zero crossing parameter; and means for comparing the figure of meritto thresholds to determine the classification.
 105. A device as definedin claim 103, comprising means for calculating the normalizedcorrelation parameter on the basis of a current weighted version of thespeech signal and a past weighted version of said speech signal.
 106. Adevice as defined in claim 103, comprising means for estimating thespectral tilt parameter as a ratio between an energy concentrated in lowfrequencies and an energy concentrated in high frequencies.
 107. Adevice as defined in claim 103, comprising means for estimating thesignal-to-noise ratio parameter as a ratio between an energy of aweighted version of the speech signal of a current frame and an energyof an error between said weighted version of the speech signal of thecurrent frame and a weighted version of a synthesized speech signal ofsaid current frame.
 108. A device as defined in claim 103, comprisingmeans for computing the pitch stability parameter in response toopen-loop pitch estimates for a first half of a current frame, a secondhalf of the current frame and a look-ahead.
 109. A device as defined inclaim 103, comprising means for computing the relative frame energyparameter as a difference between an energy of a current frame and along-term average of an energy of active speech frames.
 110. A device asdefined in claim 103, comprising means for determining the zero-crossingparameter as a number of times a sign of the speech signal changes froma first polarity to a second polarity.
 111. A device as defined in claim103, comprising means for computing at least one of the normalizedcorrelation parameter, spectral tilt parameter, signal-to-noise ratioparameter, pitch stability parameter, relative frame energy parameter,and zero crossing parameter using an available look-ahead to take intoconsideration the behavior of the speech signal in the following frame.112. A device as defined in claim 103, further comprising means fordetermining the classification of the successive frames of the encodedsound signal also on the basis of a voice activity detection flag. 113.A device as defined in claim 90, wherein: the sound signal is a speechsignal; the means for determining, in the encoder, concealment/recoveryparameters comprises means for classifying successive frames of theencoded sound signal as unvoiced, unvoiced transition, voicedtransition, voiced, or onset; and the means for determiningconcealment/recovery parameters comprises means for calculating theenergy information parameter in relation to a maximum of a signal energyfor frames classified as voiced or onset, and means for calculating theenergy information parameter in relation to an average energy per samplefor other frames.
 114. A device as defined in claim 88, wherein themeans for determining, in the encoder, concealment/recovery parameterscomprises means for computing a voicing information parameter.
 115. Adevice as defined in claim 114, wherein: the sound signal is a speechsignal; the means for determining, in the encoder, concealment/recoveryparameters comprises means for classifying successive frames of theencoded sound signal; said device comprises means for determining theclassification of the successive frames of the encoded sound signal onthe basis of a normalized correlation parameter; and the means forcomputing the voicing information parameter comprises means forestimating said voicing information parameter on the basis of thenormalized correlation.
 116. A device as defined in claim 88, whereinthe means for conducting frame erasure concealment and decoder recoverycomprises: following receiving a non erased unvoiced frame after frameerasure, means for generating no periodic part of a LP filter excitationsignal; following receiving, after frame erasure, of a non erased frameother than unvoiced, means for constructing a periodic part of the LPfilter excitation signal by repeating a last pitch period of a previousframe.
 117. A device as defined in claim 116, wherein the means forconstructing the periodic part of the LP filter excitation signalcomprises a low-pass filter for filtering the repeated last pitch periodof the previous frame.
 118. A device as defined in claim 117, wherein:the means for determining concealment/recovery parameters comprisesmeans for computing a voicing information parameter; the low-pass filterhas a cut-off frequency; and the means for constructing the periodicpart of the excitation signal comprises means for dynamically adjustingthe cut-off frequency in relation to the voicing information parameter.119. A device as defined in claim 88, wherein the means for conductingframe erasure concealment and decoder recovery comprises means forrandomly generating a non-periodic, innovation part of a LP filterexcitation signal.
 120. A device as defined in claim 119, wherein themeans for randomly generating the non-periodic, innovation part of theLP filter excitation signal comprises means for generating a randomnoise.
 121. A device as defined in claim 119, wherein the means forrandomly generating the non-periodic, innovation part of the LP filterexcitation signal comprises means for randomly generating vector indexesof an innovation codebook.
 122. A device as defined in claim 119,wherein: the sound signal is a speech signal; the means for determiningconcealment/recovery parameters comprises means for classifyingsuccessive frames of the encoded sound signal as unvoiced, unvoicedtransition, voiced transition, voiced, or onset; and the means forrandomly generating the non-periodic, innovation part of the LP filterexcitation signal further comprises: if the last correctly receivedframe is different from unvoiced, a high-pass filter for filtering theinnovation part of the excitation signal; and if the last correctlyreceived frame is unvoiced, means for using only the innovation part ofthe excitation signal.
 123. A device as defined in claim 88, wherein:the sound signal is a speech signal; the means for determining, in theencoder, concealment/recovery parameters comprises means for classifyingsuccessive frames of the encoded sound signal as unvoiced, unvoicedtransition, voiced transition, voiced, or onset; the means forconducting frame erasure concealment and decoder recovery comprises,when an onset frame is lost which is indicated by the presence of avoiced frame following frame erasure and an unvoiced frame before frameerasure, means for artificially reconstructing the lost onset byconstructing a periodic part of an excitation signal as a low-passfiltered periodic train of pulses separated by a pitch period.
 124. Adevice as defined in claim 123, wherein the means for conducting frameerasure concealment and decoder recovery further comprises means forconstructing an innovation part of the excitation signal by means ofnormal decoding.
 125. A device as defined in claim 124, wherein themeans for constructing an innovation part of the excitation signalcomprises means for randomly choosing entries of an innovation codebook.126. A device as defined in claim 123, wherein the means forartificially reconstructing the lost onset comprises means for limitinga length of the artificially reconstructed onset so that at least oneentire pitch period is constructed by the onset artificialreconstruction, said reconstruction being continued until the end of acurrent subframe.
 127. A device as defined in claim 126, wherein themeans for conducting frame erasure concealment and decoder recoveryfurther comprises, after artificial reconstruction of the lost onset,means for resuming a regular CELP processing wherein the pitch period isa rounded average of decoded pitch periods of all subframes where theartificial onset reconstruction is used.
 128. A device as defined inclaim 90, wherein the means for conducting frame erasure concealment anddecoder recovery comprises: means for controlling an energy of asynthesized sound signal produced by the decoder, the means forcontrolling energy of the synthesized sound signal comprising means forscaling the synthesized sound signal to render an energy of saidsynthesized sound signal at the beginning of a first non erased framereceived following frame erasure similar to an energy of saidsynthesized signal at the end of a last frame erased during said frameerasure; and means for converging the energy of the synthesized soundsignal in the received first non erased frame to an energy correspondingto the received energy information parameter toward the end of saidreceived first non erased frame while limiting an increase in energy.129. A device as defined in claim 90, wherein: the energy informationparameter is not transmitted from the encoder to the decoder; and themeans for conducting frame erasure concealment and decoder recoverycomprises, when a gain of a LP filter of a first non erased framereceived following frame erasure is higher than a gain of a LP filter ofa last frame erased during said frame erasure, means for adjusting theenergy of an LP filter excitation signal produced in the decoder duringthe received first non erased frame to a gain of the LP filter of saidreceived first non erased frame.
 130. A device as defined in claim 129,wherein: the means for adjusting the energy of an LP filter excitationsignal produced in the decoder during the received first non erasedframe to a gain of the LP filter of said received first non erased framecomprises means for using the following relation:$E_{q} = {E_{1}\frac{E_{LP0}}{E_{LP1}}}$ where E1 is the energy at theend of the current frame, E_(LP0) is the energy of an impulse responseof the LP filter to the last non erased frame received before the frameerasure, and E_(LP1) is the energy of the impulse response of the LPfilter to the received first non erased frame following frame erasure.131. A device as defined in claim 128, wherein: the sound signal is aspeech signal; the means for determining, in the encoder,concealment/recovery parameters comprises means for classifyingsuccessive frames of the encoded sound signal as unvoiced, unvoicedtransition, voiced transition, voiced, or onset; and when the first nonerased frame received after a frame erasure is classified as ONSET, themeans for conducting frame erasure concealment and decoder recoverycomprises means for limiting to a given value a gain used for scalingthee synthesized sound signal.
 132. A device as defined in claim 128,wherein: the sound signal is a speech signal; the means for determining,in the encoder, concealment/recovery parameters comprises means forclassifying successive frames of the encoded sound signal as unvoiced,unvoiced transition, voiced transition, voiced, or onset; and saiddevice comprises means for making a gain used for scaling thesynthesized sound signal at the beginning of the first non erased framereceived after frame erasure equal to a gain used at the end of saidreceived first non erased frame: during a transition from a voiced frameto an unvoiced frame, in the case of a last non erased frame receivedbefore frame erasure classified as voiced transition, voice or onset anda first non erased frame received after frame erasure classified asunvoiced; and during a transition from a non-active speech period to anactive speech period, when the last non erased frame received beforeframe erasure is encoded as comfort noise and the first non erased framereceived after frame erasure is encoded as active speech.
 133. A devicefor conducting concealment of frame erasure caused by frames of anencoded sound signal erased during transmission from an encoder to adecoder, comprising: means for determining, in the encoder,concealment/recovery parameters; and means for transmitting to thedecoder concealment/recovery parameters determined in the encoder. 134.A device as defined in claim 133, further comprising means forquantizing, in the encoder, the concealment/recovery parameters prior totransmitting said concealment/recovery parameters to the decoder.
 135. Adevice as defined in claim 133, wherein the concealment/recoveryparameters are selected from the group consisting of: a signalclassification parameter, an energy information parameter and a phaseinformation parameter.
 136. A device as defined in claim 135, whereinthe means for determining the phase information parameter comprisesmeans for determining the position of a first glottal pulse in a frameof the encoded sound signal.
 137. A device as defined in claim 136,wherein the means for determining the phase information parameterfurther comprises means for encoding, in the encoder, the shape, signand amplitude of the first glottal pulse and means for transmitting theencoded shape, sign and amplitude from the encoder to the decoder. 138.A device as defined in claim 136, wherein the means for determining theposition of the first glottal pulse comprises: means for measuring thefirst glottal pulse as a sample of maximum amplitude within a pitchperiod; and means for quantizing the position of the sample of maximumamplitude within the pitch period.
 139. A device as defined in claim133, wherein: the sound signal is a speech signal; and the means fordetermining, in the encoder, concealment/recovery parameters comprisesmeans for classifying successive frames of the encoded sound signal asunvoiced, unvoiced transition, voiced transition, voiced, or onset. 140.A device as defined in claim 139, wherein the means for classifying thesuccessive frames comprises means for classifying as unvoiced everyframe which is an unvoiced frame, every frame without active speech, andevery voiced offset frame having an end tending to be unvoiced.
 141. Adevice as defined in claim 139, wherein the means for classifying thesuccessive frames comprises means for classifying as unvoiced transitionevery unvoiced frame having an end with a possible voiced onset which istoo short or not built well enough to be processed as a voiced frame.142. A device as defined in claim 139, wherein the means for classifyingthe successive frames comprises means for classifying as voicedtransition every voiced frame with relatively weak voicedcharacteristics, including voiced frames with rapidly changingcharacteristics and voiced offsets lasting the whole frame, wherein aframe classified as voiced transition follows only frames classified asvoiced transition, voiced or onset.
 143. A device as defined in claim139, wherein the means for classifying the successive frames comprisesmeans for classifying as voiced every voiced frames with stablecharacteristics, wherein a frame classified as voiced follows onlyframes classified as voiced transition, voiced or onset.
 144. A deviceas defined in claim 139, wherein the means for classifying thesuccessive frames comprises means for classifying as onset every voicedframe with stable characteristics following a frame classified asunvoiced or unvoiced transition.
 145. A device as defined in claim 139,comprising means for determining the classification of the successiveframes of the encoded sound signal on the basis of at least a part ofthe following parameters: a normalized correlation parameter, a spectraltilt parameter, a signal-to-noise ratio parameter, a pitch stabilityparameter, a relative frame energy parameter, and a zero crossingparameter.
 146. A device as defined in claim 145, wherein the means fordetermining the classification of the successive frames comprises: meansfor computing a figure of merit on the basis of the normalizedcorrelation parameter, spectral tilt parameter, signal-to-noise ratioparameter, pitch stability parameter, relative frame energy parameter,and zero crossing parameter; and means for comparing the figure of meritto thresholds to determine the classification.
 147. A device as definedin claim 145, comprising means for calculating the normalizedcorrelation parameter on the basis of a current weighted version of thespeech signal and a past weighted version of said speech signal.
 148. Adevice as defined in claim 145, comprising means for estimating thespectral tilt parameter as a ratio between an energy concentrated in lowfrequencies and an energy concentrated in high frequencies.
 149. Adevice as defined in claim 145, comprising means for estimating thesignal-to-noise ratio parameter as a ratio between an energy of aweighted version of the speech signal of a current frame and an energyof an error between said weighted version of the speech signal of thecurrent frame and a weighted version of a synthesized speech signal ofsaid current frame.
 150. A device as defined in claim 145, comprisingmeans for computing the pitch stability parameter in response toopen-loop pitch estimates for a first half of a current frame, a secondhalf of the current frame and a look-ahead.
 151. A device as defined inclaim 145, comprising means for computing the relative frame energyparameter as a difference between an energy of a current frame and along-term average of an energy of active speech frames.
 152. A device asdefined in claim 145, comprising means for determining the zero-crossingparameter as a number of times a sign of the speech signal changes froma first polarity to a second polarity.
 153. A device as defined in claim145, comprising means for computing at least one of the normalizedcorrelation parameter, spectral tilt parameter, signal-to-noise ratioparameter, pitch stability parameter, relative frame energy parameter,and zero crossing parameter using an available look-ahead to take intoconsideration the behavior of the speech signal in the following frame.154. A device as defined in claim 145, further comprising means fordetermining the classification of the successive frames of the encodedsound signal also on the basis of a voice activity detection flag. 155.A device as defined in claim 135, wherein: the sound signal is a speechsignal; the means for determining, in the encoder, concealment/recoveryparameters comprises means for classifying successive frames of theencoded sound signal as unvoiced, unvoiced transition, voicedtransition, voiced, or onset; and the means for determiningconcealment/recovery parameters comprises means for calculating theenergy information parameter in relation to a maximum of a signal energyfor frames classified as voiced or onset, and means for calculating theenergy information parameter in relation to an average energy per samplefor other frames.
 156. A device as defined in claim 133, wherein themeans for determining, in the encoder, concealment/recovery parameterscomprises means for computing a voicing information parameter.
 157. Adevice as defined in claim 156, wherein: the sound signal is a speechsignal; the means for determining, in the encoder, concealment/recoveryparameters comprises means for classifying successive frames of theencoded sound signal; said device comprises means for determining theclassification of the successive frames of the encoded sound signal onthe basis of a normalized correlation parameter; and the means forcomputing the voicing information parameter comprises means forestimating said voicing information parameter on the basis of thenormalized correlation.
 158. A device for the concealment of frameerasure caused by frames erased during transmission of a sound signalencoded under the form of signal-encoding parameters from an encoder toa decoder, comprising: means for determining, in the decoder,concealment/recovery parameters from the signal-encoding parameters; inthe decoder, means for conducting erased frame concealment and decoderrecovery in response to concealment/recovery parameters determined bythe determining means.
 159. A device as defined in claim 158, whereinthe concealment/recovery parameters are selected from the groupconsisting of: a signal classification parameter, an energy informationparameter and a phase information parameter.
 160. A device as defined inclaim 158, wherein: the sound signal is a speech signal; and the meansfor determining, in the decoder, concealment/recovery parameterscomprises means for classifying successive frames of the encoded soundsignal as unvoiced, unvoiced transition, voiced transition, voiced, oronset.
 161. A device as defined in claim 158, wherein the means fordetermining, in the decoder, concealment/recovery parameters comprisesmeans for computing a voicing information parameter.
 162. A device asdefined in claim 158, wherein the means for conducting frame erasureconcealment and decoder recovery comprises: following receiving a nonerased unvoiced frame after frame erasure, means for generating noperiodic part of a LP filter excitation signal; following receiving,after frame erasure, of a non erased frame other than unvoiced, meansfor constructing a periodic part of the LP filter excitation signal byrepeating a last pitch period of a previous frame.
 163. A device asdefined in claim 162, wherein the means for constructing the periodicpart of the excitation signal comprises a low-pass filter for filteringthe repeated last pitch period of the previous frame.
 164. A device asdefined in claim 163, wherein: the means for determining, in thedecoder, concealment/recovery parameters comprises means for computing avoicing information parameter; the low-pass filter has a cut-offfrequency; and the means for constructing the periodic part of the LPfilter excitation signal comprises means for dynamically adjusting thecut-off frequency in relation to the voicing information parameter. 165.A device as defined in claim 158, wherein the means for conducting frameerasure concealment and decoder recovery comprises means for randomlygenerating a non-periodic, innovation part of a LP filter excitationsignal.
 166. A device as defined in claim 165, wherein the means forrandomly generating the non-periodic, innovation part of the LP filterexcitation signal comprises means for generating a random noise.
 167. Adevice as defined in claim 165, wherein the means for randomlygenerating the non-periodic, innovation part of the LP filter excitationsignal comprises means for randomly generating vector indexes of aninnovation codebook.
 168. A device as defined in claim 165, wherein: thesound signal is a speech signal; the means for determination, in thedecoder, concealment/recovery parameters comprises means for classifyingsuccessive frames of the encoded sound signal as unvoiced, unvoicedtransition, voiced transition, voiced, or onset; and the means forrandomly generating the non-periodic, innovation part of the LP filterexcitation signal further comprises: if the last received non erasedframe is different from unvoiced, a high-pass filter for filtering theinnovation part of the LP filter excitation signal; and if the lastreceived non erased frame is unvoiced, means for using only theinnovation part of the LP filter excitation signal.
 169. A device asdefined in claim 165, wherein: the sound signal is a speech signal; themeans for determining, in the decoder, concealment/recovery parameterscomprises means for classifying successive frames of the encoded soundsignal as unvoiced, unvoiced transition, voiced transition, voiced, oronset; the means for conducting frame erasure concealment and decoderrecovery comprises, when an onset frame is lost which is indicated bythe presence of a voiced frame following frame erasure and an unvoicedframe before frame erasure, means for artificially reconstructing thelost onset by constructing a periodic part of an excitation signal as alow-pass filtered periodic train of pulses separated by a pitch period.170. A device as defined in claim 169, wherein the means for conductingframe erasure concealment and decoder recovery further comprises meansfor constructing an innovation part of the LP filter excitation signalby means of normal decoding.
 171. A device as defined in claim 170,wherein the means for constructing an innovation part of the LP filterexcitation signal comprises means for randomly choosing entries of aninnovation codebook.
 172. A device as defined in claim 169, wherein themeans for artificially reconstructing the lost onset comprises means forlimiting a length of the artificially reconstructed onset so that atleast one entire pitch period is constructed by the onset artificialreconstruction, said reconstruction being continued until the end of acurrent subframe.
 173. A device as defined in claim 172, wherein themeans for conducting frame erasure concealment and decoder recoveryfurther comprises, after artificial reconstruction of the lost onset,means for resuming a regular CELP processing wherein the pitch period isa rounded average of decoded pitch periods of all subframes where theartificial onset reconstruction is used.
 174. A device as defined inclaim 159, wherein: the energy information parameter is not transmittedfrom the encoder to the decoder; and the means for conducting frameerasure concealment and decoder recovery comprises, when a gain of a LPfilter of a first non erased frame received following frame erasure ishigher than a gain of a LP filter of a last frame erased during saidframe erasure, means for adjusting the energy of an LP filter excitationsignal produced in the decoder during the received first non erasedframe to a gain of the LP filter of said received first non erased frameusing the following relation: $E_{q} = {E_{1}\frac{E_{LP0}}{E_{LP1}}}$where E1 is the energy at the end of the current frame, E_(LP0) is theenergy of an impulse response of the LP filter to the last non erasedframe received before the frame erasure, and E_(LP1) is the energy ofthe impulse response of the LP filter to the received first non erasedframe following frame erasure.
 175. A system for encoding and decoding asound signal, comprising: a sound signal encoder responsive to the soundsignal for producing a set of signal-encoding parameters; means fortransmitting the signal-encoding parameters to a decoder; said decoderfor synthesizing the sound signal in response to the signal-encodingparameters; and a device as recited in claim 88, for concealing frameerasure caused by frames of the encoded sound signal erased duringtransmission from the encoder to the decoder.
 176. A decoder fordecoding an encoded sound signal comprising: means responsive to theencoded sound signal for recovering from said encoded sound signal a setof signal-encoding parameters; means for synthesizing the sound signalin response to the signal-encoding parameters; and a device as recitedin claim 158, for concealing frame erasure caused by frames of theencoded sound signal erased during transmission from an encoder to thedecoder.
 177. An encoder for encoding a sound signal comprising: meansresponsive to the sound signal for producing a set of signal-encodingparameters; means for transmitting the set of signal-encoding parametersto a decoder responsive to the signal-encoding parameters for recoveringthe sound signal; and a device as recited in claim 133, for conductingconcealment of frame erasure caused by frames erased during transmissionof the signal-encoding parameters from the encoder to the decoder.